Search engine comparison tool. Google and Bing - web

I am trying to build a search engine comparison tool between bing and google that will analyze which of the top n results are matching. Since I don't have much web-development experience, (most of my experience lies in Windows Application development and lower level stuff.) I was wondering if somebody could point me in the right direction. I'm guessing that one way of doing this would be to download the search results and somehow find all of the links which are results and then comparing them.
What language can I use to do this?

You could use a language of your choice and build upon APIs. Bing already has one
Although Google doesn't have a direct search API (at least none that I know of), if you are a student planning to do some research, you can sign up for their university program and they'll expose you an API. Trying to download the page and parsing it would be difficult, since Google uses some security measures to avoid direct crawls.

Related

What are the pros & cons of using Google CSE vs implementing dedicated search engine for a site like stackoverflow?

I understand that same work should not be repeated when Google CSE is already there, so what may be the reasons to should consider implementing a dedicated search engine for a public facing website similar to SO(& why probably StackOverflow did that ?). Paid version of CSE(Google site Search), already eliminates several drawbacks that forced dedicated implementation. Cost may be one reason to not choose Google CSE, but what are other reasons ?
Another thing I want to ask is my site is similar kind as StackOverflow, so when Google indexes its content every now & then, won't that overload my database servers with lots of queries may be when there is peak traffic time?
I look forward to use Google Custom search API but I need to clarify whether the 1000 paid queries that I get for 5$ are valid only for 1 day or they get adjusted to extra queries(beyond free ones) on the next day & so on. Can anyone clarify on this too?
This depends on the content of your site, the frequency of the updates, and the kind of search you want to provide.
For example, with StackOverflow, there'd probably be no way to search for questions of an individual user through Google, but it can be done with an internal search engine easily.
Similarly, Google can outdate their API at any time; in fact, if past experience is any indication, Google has already done so with their Google Web Search API, where a lot of non-profits that had projects based on such API were left on the street with no Google options for continuation of their services (paying 100 USD/year for only 20'000 search queries per year, may be fine for a posh blog indeed, but greatly limits what you can actually use the search API for).
On the other hand, you probably already want to have Google index all of your pages, to get the organic search traffic, so Google CSE would probably use rather minimal resources of your server, compared to having a complete in-house search engine.
Now that Google Site Search is gone, the best search tool alternative for all the loyal Google fans is Google Custom Search (CSE)
Some of the features of Google Custom Search that I loved the most, were :-
Its free (with ads)
Ability to monetise those ads with your AdSense Account
Tons of Customization options, including removing the Google branding,
Ability to link it with Google Analytics account, for highly comprehensive analytical report,
Powerful auto correct feature to understand the real intention behind the typos,
Cons : Lacks customer Support…
Read More: https://www.techrbun.com/2019/05/google-custom-search-features.html

How do I get started with information extraction?

I am a newbie when it comes to information extraction. For the past several days, I have read a lot of academic papers and ordered a book on NLP. I want to figure out how I can build a FlipDog.com like system (hopefully not from scratch). They extract job openings from more than 60,000 company web sites. How do I get started?
I am open to learning any programming language. Has anybody used Mallet/GATE/MinorThird or RoadRunner? Ideally, I want to be able to train a system with the data set particular to my domain and have it extract information based on that. Which platform would you recommend for this purpose?
Thanks!
The faster way to extract job offerings is to use dapper.net (a web scraping service from websites). You can very easily to teach dapper to extract data using visual editor. It works very well when on your target websites you have tables.
To learn Information Extraction, I suggest to start from lingpipe. It is a java framework for Information Extraction, so you do not need to learn architectural specific features of the framework, such as Gate or Apache UIMA. On lingpipe website you will find a lot of tutorials which will help you to learn various Information Extraction approaches. After that I suggest to learn Gate and UIMA.
If you want to realize such a website, you also need to learn how to use web crawler frameworks (e.g., nutch), web search engines (yahoo, google, bing), and Information Retrieval engines (such as, apache lucene) to provide a search service on the top of extracted data.
Update:
For python, it is the best to start with: http://www.nltk.org/

Embed Google/ Yahoo search into a web site or build your own

I am looking for an opinion on the whether to use Google custom search, Yahoo search builder or build my own for web projects (no more than 100 pages of content). If I should build my own - do you have any fast start kits you could recommend?
Many thanks
Chris
I have had success using OpenSearch for my personal blog.
While working at BigCorp we used dedicated search applicances in yellow boxes, but in your case (around 100 pages) it does not make sense to take such a route.
I would suggest going with either Google Custom Search, or Yahoo Search Builder (as long as they both index your site sufficiently to provide good results).
More often than not, you'll get better results and you don't have to worry about building your own custom engine (or implementing an off the shelf/open source piece of software to do the job for you).
I've used IBM OmniFind Yahoo Edition and had fantastic results with it. You are limited to a single index per implementation but it's very fast and easy to integrate with and extensible in terms of search customization. I've used it with a ASP.NET site without issue. A caveat being that it needs to be installed on the server and running as a service so it is out of the question for most shared hosting. It has the index capabilities of general search engines (pdf/html/etc) which is very nice.
Edit:
I forgot to mention that some of the reasons I liked it vs other options is that it is free and doesn't require any additional hardware, just FYI.
The main situation I see Google/Yahoo as being sub-optimal is when your site relies on up-to-the-minute results. You're at the mercy of their crawling policies/speed/etc. If that's okay (and I suspect it will be for most 100ish page sites), use them - the results will be great. If realtime results are important, you may have to bite the bullet and install something locally.
Yahoo boss is cheaper and recommended by many people
I am going to integrate it soon.

Getting Started with Google Programming

I'm just beginning a project that involves working with a few of Google's APIs (for .NET), specifically the Contacts List, Calendar and Gmail. While Google does provide a wealth of information through their code.google.com network, finding what I need to get started has thus far proven to be a monumental task. What I'm hoping to find is a "big picture" look at what Google makes available to developers as well as a few sample pieces of code to ease me off the starting block.
Does anyone know where I can find a handful of simple, useful examples developing basic applications with Google's API (I've come across a few examples within code.google.com, but they're so rudimentary that they're not helpful)? Are there any resources (in print or online) that can spoon feed a Google novice without burying them? Is there a special, hidden nook within code.google.com for Google beginners?
Any information anyone could share would be very helpful.
http://code.google.com/apis/ajax/playground/

search integration

I am working on a website that currently has a number of disparate search functions, for example:
A crawl 'through the front door' of the website
A search that communicates with a web-service
etc...
What would be the best way to tie these together, and provide what appears to be a unified search function?
I found the following list on wikipedia
Free and open source enterprise search software
Lucene and Solr
Xapian
Vendors of proprietary enterprise search software
AskMeNow
Autonomy Corporation
Concept Searching Limited
Coveo
Dieselpoint, Inc.
dtSearch Corp.
Endeca Technologies Inc.
Exalead
Expert System S.p.A.
Funnelback
Google Search Appliance
IBM
ISYS Search Software
Microsoft (includes Microsoft Search Server, Fast Search & Transfer):
Open Text Corporation
Oracle Corporation
Queplix Universal Search Appliance
SAP
TeraText
VivĂ­simo
X1 Technologies, Inc.
ZyLAB Technologies
Thanks for any advice regarding this.
Solr is an unbelievably flexible solution for search. Just in the last year I coded 2 solr-based websites and worked on a third existing one, each worked in a very different way.
Solr simply eats XML requests to add something to index, and XML requests to search for something inside an index. It doesn't do crawling or text extraction for you, but most of the time these are easy to do. There are many existing addons to Solr/Lucene stack so maybe something for you already exists.
I would avoid proprietary software unless you're sure Solr is insufficient. It's one of the nicest programs I've worked with, very flexible when you need it and at the same time you can start in minutes without reading long manuals.
Note that no matter what search solution you use, a search setup is "disparate" by nature.
You will still have an indexer, and a search UI, or the "framework".
You WILL corner yourself by marrying a specific search technology. You actually want to have the UI as separate from the search backend as possible. The backend may stop to scale, or there may be a better search engine out there tomorrow.
Switching search engines is very common, so never - ever - write your interface with a specific search engine in mind. Always abstract it, so the UI is not aware of the actual search technology used.
Keep it modular, and you will thank yourself later.
By using a standard web services interface, you can also allow 3rd parties to build stuff for you, and they won't have to "learn" whatever search engine you use on the backend.
Take a look at these similar questions:
Best text search engine for integrating with custom web app?
How do I implement Search Functionality in a website?
My personal recommendation: Solr.
All these companies offer different features of Universal Search. Smaller companies carved themselves very functional and extremely desired niches. For example Queplix enables any search engine to work with structured data and enterprise applications by extracting the data, business objects, roles and permissions from all indexed applications. It provides enterprise-ranking criteria as well as data-compliance alerts.
Two other solutions that weren't as well-known &/or available around the time the original question was asked:
Google Custom Search - especially since the disable public URL option was recently added
YaCy - you can join the network or download and roll your own independent servers

Resources