If you look at SharpCrafters, the front page has a cool component that shows how popular certain search terms are, with larger text for more popular terms. I've seen this around the web in different places, especially blogs. What is this called in general, and what specific implementations exist?
That would be the Tag Cloud.
Related
I am interested in implementing a search feature on a website. It is a location search, so address/state/zip all should work. Which will then show results in that area and allow it to be filtered.
My question is:
What's the best approach for something like this?
There are literally dozens of ways of doing this (if not more). The exact implementation would depend on the technology stack that you use, but as a very top level overview:
you'd need to store the things you are searching for somewhere, and tag them with a lat/long location. Often, this would be in a database of some kind.
using a programming language, you would need to write a search that accepts a postcode, translates that to a lat/long and then searches the things in your database based on the distance between the location of the thing, and the location entered in the search.
if you want to support filtering, your search would need to support that too. This is often called "faceting" the search.
Working out the lat/long locations will need to be done using a GeoLocation service, there are some, such as PostCode Anywhere that will do this as a paid service, and others that are free (within reason), such as the Google Maps APIs.
There are probably some hosted services that will do what you want, you'd have to shop around.
Examples of search software that supports geolocation searching out of the box are things like Solr, Azure Search, Lucene and Elastic.
Google is not using meta-keywords tag at all because keywords are mostly used to spam search engines.
Google is not using the meta-description tag for ranking. Sometimes the meta-description tag is used for the site-snippet in search results if part of the content does not fit. But mostly meta-description is generated automatically from the content of the page and meta-description is the same as beginning of the content of the page.
Google has dropped the support of meta-keywords and meta-description tags for search ranking. Google handles about 92% of all search queries in the world. So now web-developers can stop using meta-keywords and meta-description meta tags, because spending time on them is not worth it.
Is there any real benefit for using meta-keywords and meta-description tags?
Links:
Google Webmasters Blog about meta tags support by Google;
Video with Matt Cutts about meta tags support by Google;
StatCounter Search Engines stats usage - Google handles about 92% of all search queries in the world;
No, we should carry on using meta tags, because we don't, and shouldn't be, just supporting Google. The approach should be: make documents as indexable as possible using a search-engine agnostic approach, and then put special handling in for one or two top engines - such as using Google's online tools to improve search ranking.
Google are very dominant in search at present, but there's no guarantee they will always be on top. Maybe it will be Facebook in the future, or perhaps Yahoo/Bing etc. will dramatically improve search quality, and people will switch back.
Side note: for search, I really like DuckDuckGo at the moment. Lots of nice search shortcuts (see bang operators) and a meaningful privacy policy.
We should use them because they are there. Who knows - perhaps they will be useful again in the future?
When the W3C drop them we can stop using them.
Just my opinion ofc...
keywords:
Google is not the only search engine. Google market share depends on many factors (country, age, technical know-how, …). Small percentages of big numbers are also big numbers.
There are special purpose search engines (for niches; only crawling hand-selected sites; etc.) that might still consider it.
Local search engines might use it. (Local) full text search engines anyway.
Some CMS use it for site search.
There are other consuming user-agents than search engines, e.g. parser/extractor.
description:
it can be useful even for Google, e.g. when someone searches only for the title/domain of your site, Google would often display snippets like "Login / Register … back to top … please insert CAPTCHA … " etc. If a description is provided, it could be used instead.
(the points mentionend under keywords are relevant for description, too.)
If google SEO is your only concern then meta keywords are a complete waste of time, but if you're targeting other search engines it may be worth investigating.
I believe Baidu still uses meta keywords, and that search engine is the dominant player in the Chinese market, so it'd be worth adding meta keywords if you want your site to be popular in China.
Regardless, I wouldn't go stuffing excessive numbers of irrelevant keywords in, as there is every chance that whatever search engine you're targeting will penalise you. 5-7 words summarising your page content is a good starting point.
I'm trying to build a local version of the freebase search api using their quad dumps. I'm wondering what algorithm they use to match names? As an example, if you go to freebase.com and type in "Hiking" you get
"Apo Hiking Society"
"Hiking"
"Hiking Georgia"
"Hiking Virginia's national forests"
"Hiking trail"
Wow, a lot of guesses! I hope I don't muddy the waters too much by not guessing too.
The auto-complete box is basically powered by Freebase Suggest which is powered, in turn, by the Freebase Search service. Strings which are indexed by the search service for matching include: 1) the name, 2) all aliases in the given language, 3) link anchor text from the associated Wikipedia articles and 4) identifiers (called keys by Freebase), which includes things like Wikipedia article titles (and redirects).
How the various things are weighted/boosted hasn't been disclosed, but you can get a feel for things by playing with it for while. As you can see from the API, there's also the ability to do filtering/weighting by types and other criteria and this can come into play depending on the context. For example, if you're adding a record label to an album, topics which are typed as record labels will get a boost relative to things which aren't (but you can still get to things of other types to allow for the use case where your target topic doesn't hasn't had the appropriate type applied yet).
So that gives you a little insight into how their service works, but why not build a search service that does what you need since you're starting from scratch anyway?
BTW, pre-Google the Metaweb search implementation was based on top of Lucene, so you could definitely do worse than using that as your starting point. You can read some of the details in the mailing list archive
Probably they use an inverted Index over selected fields, such as the English name, aliases and the Wikipedia snippet displayed. In your application you can achieve that using something like Lucene.
For the algorithm side, I find the following paper a good overview
Zobel and Moffat (2006): "Inverted Files for Text Search Engines".
Most likely it's a trie with lexicographical order.
There are a number of algorithms available: Boyer-Moore, Smith-Waterman-Gotoh, Knuth Morriss-Pratt etc. You might also want to check up on Edit distance algorithms such as Levenshtein. You will need to play around to see which best suits your purpose.
An implementation of such algorithms is the Simmetrics library by the University of Sheffield.
I need to develop a vertical search engine as part of website. The data for the search engine comes from websites of specific category. I guess for this I need to have a crawler that crawls several (a few hundred) sites (in a specific business category) and extract content and urls of products and services. Other types of pages may be irrelevant. Most of the sites are tiny or small (a few hundred pages at the most). The products have 10 to 30 attributes.
Any ideas on how to write such a crawler and extractor. I have written a few crawlers and content extractors using usual ruby libraries, but not a full fledged search engine. I guess, crawler, from time to time, wakes up and downloads the pages from websites. Usual polite behavior like checking robots exclusion rules will be followed, of course. While the content extractor can update the database after it reads the pages. How do I synchronize crawler and extractor? How tightly should they be integrated?
Nutch builds on Lucene and already implements a crawler and several document parsers.
You can also hook it to Hadoop for scalability.
In the enterprise-search context that I am used to working in,
crawlers,
content extractors,
search engine indexes (and the loading of your content into these indexes),
being able to query that data effciently and with a wide range of search operators,
programmatic interfaces to all of these layers,
optionally, user-facing GUIs
are all seperate topics.
(For example, while extracting useful information from an HTML page VS PDF VS MS Word files are conceptually similar, the actual programming for these tasks are still very much works-in-progress for any general solution.)
You might want to look at the Lucene suite of open-source tools, understand how those fit together, and possibly decide that it would be beter to learn how to use those tools (or others, similar), than to reinvent the very big, complicate wheel.
I believe in books, so thanks to your query, I have discovered this book and have just ordered it. It looks like good take on one possible solution to the search-tool conumdrum.
http://www.amazon.com/Building-Search-Applications-Lucene-LingPipe/product-reviews/0615204252/ref=cm_cr_pr_hist_5?ie=UTF8&showViewpoints=0&filterBy=addFiveStar
Good luck and let us know what you find out and the approach you decide to take.
I'm wondering if there are any good .NET recommendation algorithms available in open source projects, whether attached to a search engine or not. By recommendation I mean something that accepts a full-text article and recommends other articles from its index based on keyword similarity.
At the high end there are document classification engines like Autonomy; at the low-end spam filters and blog "related posts" widgets. Possibly advertisement-to-article matching, too. I'd like to incorporate one into a project but can't afford the high end and the low end seems to all be LAMP-based.
[Sorry, one answer asked for clarification: What I'm looking for is ideally a standalone library, but I'm willing to adapt good source code as necessary. The end result is that I need to be able to create a C# service that accepts an arbitrary amount of text and returnsa list of similar previously-indexed articles. Basicallly, the exact thing that StackOverflow itself does as you are submitting a question!]
Thanks!
Steve
I think that in StackOverflow they extract all common english words from the text and then compare this words with the remaining words of other posts to get the "Related" posts.
Question is not very clear (algorithm or library???) but only thing that comes to mind is Lucene.NET, the porting of the popular Lucene library on the .Net framework. HTH.