search integration - search

I am working on a website that currently has a number of disparate search functions, for example:
A crawl 'through the front door' of the website
A search that communicates with a web-service
etc...
What would be the best way to tie these together, and provide what appears to be a unified search function?
I found the following list on wikipedia
Free and open source enterprise search software
Lucene and Solr
Xapian
Vendors of proprietary enterprise search software
AskMeNow
Autonomy Corporation
Concept Searching Limited
Coveo
Dieselpoint, Inc.
dtSearch Corp.
Endeca Technologies Inc.
Exalead
Expert System S.p.A.
Funnelback
Google Search Appliance
IBM
ISYS Search Software
Microsoft (includes Microsoft Search Server, Fast Search & Transfer):
Open Text Corporation
Oracle Corporation
Queplix Universal Search Appliance
SAP
TeraText
VivĂ­simo
X1 Technologies, Inc.
ZyLAB Technologies
Thanks for any advice regarding this.

Solr is an unbelievably flexible solution for search. Just in the last year I coded 2 solr-based websites and worked on a third existing one, each worked in a very different way.
Solr simply eats XML requests to add something to index, and XML requests to search for something inside an index. It doesn't do crawling or text extraction for you, but most of the time these are easy to do. There are many existing addons to Solr/Lucene stack so maybe something for you already exists.
I would avoid proprietary software unless you're sure Solr is insufficient. It's one of the nicest programs I've worked with, very flexible when you need it and at the same time you can start in minutes without reading long manuals.

Note that no matter what search solution you use, a search setup is "disparate" by nature.
You will still have an indexer, and a search UI, or the "framework".
You WILL corner yourself by marrying a specific search technology. You actually want to have the UI as separate from the search backend as possible. The backend may stop to scale, or there may be a better search engine out there tomorrow.
Switching search engines is very common, so never - ever - write your interface with a specific search engine in mind. Always abstract it, so the UI is not aware of the actual search technology used.
Keep it modular, and you will thank yourself later.
By using a standard web services interface, you can also allow 3rd parties to build stuff for you, and they won't have to "learn" whatever search engine you use on the backend.

Take a look at these similar questions:
Best text search engine for integrating with custom web app?
How do I implement Search Functionality in a website?
My personal recommendation: Solr.

All these companies offer different features of Universal Search. Smaller companies carved themselves very functional and extremely desired niches. For example Queplix enables any search engine to work with structured data and enterprise applications by extracting the data, business objects, roles and permissions from all indexed applications. It provides enterprise-ranking criteria as well as data-compliance alerts.

Two other solutions that weren't as well-known &/or available around the time the original question was asked:
Google Custom Search - especially since the disable public URL option was recently added
YaCy - you can join the network or download and roll your own independent servers

Related

User orientated search?

I am currently looking into finding a search device that can facilitate a lot of documents and a few different websites and an LMS.
Where this differs is that we would like there to be a heavy amount of relevancy based on user roles. Everyone is auto-logged into all of our systems via SSO much like this site. We want heavy weighting to be put on documents, web site articles/knowledgebase, and class in the LMS that are for that user's selected role.
I personally have limited knowledge of solr which we use for some full text searches. I have considered looking into elasticsearch, solr, google appliance, and FAST.
Do any of these have any innate features that will help me get to my end goal faster? My worries about elasticsearch and solr is the amount of development time. Our group has done limited search customization so also wondering on dev time needed for various solutions.
You must have not done much research yet, because FAST is gone (absorbed by Microsoft) and Google Appliance is not very customizable.
Solr is more customizable than Elasticsearch, so that's my recommendation. For the rest, I would start by having the field for a role and then using that as a boost factor.
If the basic approach does not work, the Solr-Users mailing list may actually be a better place to follow-up as it allows a discussion to fine tune the issue.
(Updated)
For more packaged solutions that integrate with Solr and include crawling, you can look at:
Apache Nutch to add crawling specifically
Apache ManifoldCF - has a lot of integrators into other data sources
LucidWorks Fusion (commercial)
Flexile search platform (commercial)
Cloudera - is a Big Data solution, but it integrates with Solr and - may - have a crawler. It does have a nice UI Hue
And probably more, if you dig a little bit.

What tools are out there for an Intranet search engine across a diverse toolset?

Basic requirements:
Should be able to index things like MediaWiki, Confluence, Sharepoint, GitHub:Enterprise, Askbot
Should be reasonably smart about de-duping results (one reason Confluence search is so painful).
Should definitely incorporate heuristics like how many pages link to a document, whether the search terms are in the title of the document, etc. If there's a way for users to downrank particular results, that might be a bonus.
Should be somewhat tunable (e.g., prefer Confluence over Sharepoint, blacklist certain paths).
Are there off-the-shelf products that can do the above? FOSS projects? Are there FOSS projects that can provide the basics for the above and are easy to extend or build a frontend for?
You can try Apache Solr, it's a great tool.
According to the website:
Solr is the popular, blazing fast open source enterprise search
platform from the Apache Lucene project. Its major features include
powerful full-text search, hit highlighting, faceted search, near
real-time indexing, dynamic clustering, database integration, rich
document (e.g., Word, PDF) handling, and geospatial search. Solr is
highly reliable, scalable and fault tolerant, providing distributed
indexing, replication and load-balanced querying, automated failover
and recovery, centralized configuration and more. Solr powers the
search and navigation features of many of the world's largest internet
sites.
You could try a bundled version of Solr and other tools such as OpenESP or Constellio. Expect to spend some time tuning the sources and imports. ManifoldCF which is bundled with OpenESP is an open source connector/crawler framework for plugging in connectors to various systems like those you describe, and several connectors come out of the box.
You can try Moogle. It is open source easily employable in windows with IIS. just having look as google so you feel bit familiar with it. Try http://techstuff.smsjuju.com/intranet-search-engine/

What are the pros & cons of using Google CSE vs implementing dedicated search engine for a site like stackoverflow?

I understand that same work should not be repeated when Google CSE is already there, so what may be the reasons to should consider implementing a dedicated search engine for a public facing website similar to SO(& why probably StackOverflow did that ?). Paid version of CSE(Google site Search), already eliminates several drawbacks that forced dedicated implementation. Cost may be one reason to not choose Google CSE, but what are other reasons ?
Another thing I want to ask is my site is similar kind as StackOverflow, so when Google indexes its content every now & then, won't that overload my database servers with lots of queries may be when there is peak traffic time?
I look forward to use Google Custom search API but I need to clarify whether the 1000 paid queries that I get for 5$ are valid only for 1 day or they get adjusted to extra queries(beyond free ones) on the next day & so on. Can anyone clarify on this too?
This depends on the content of your site, the frequency of the updates, and the kind of search you want to provide.
For example, with StackOverflow, there'd probably be no way to search for questions of an individual user through Google, but it can be done with an internal search engine easily.
Similarly, Google can outdate their API at any time; in fact, if past experience is any indication, Google has already done so with their Google Web Search API, where a lot of non-profits that had projects based on such API were left on the street with no Google options for continuation of their services (paying 100 USD/year for only 20'000 search queries per year, may be fine for a posh blog indeed, but greatly limits what you can actually use the search API for).
On the other hand, you probably already want to have Google index all of your pages, to get the organic search traffic, so Google CSE would probably use rather minimal resources of your server, compared to having a complete in-house search engine.
Now that Google Site Search is gone, the best search tool alternative for all the loyal Google fans is Google Custom Search (CSE)
Some of the features of Google Custom Search that I loved the most, were :-
Its free (with ads)
Ability to monetise those ads with your AdSense Account
Tons of Customization options, including removing the Google branding,
Ability to link it with Google Analytics account, for highly comprehensive analytical report,
Powerful auto correct feature to understand the real intention behind the typos,
Cons : Lacks customer Support…
Read More: https://www.techrbun.com/2019/05/google-custom-search-features.html

Software alternatives to Google Search Appliance (GSA)

I am interested in software alternatives to the Google Search Appliance (GSA) for use in a (large) university context. Has anyone experiences of migrating from GSA to an alternative solution? If so, what were the reasons for doing this (technical, financial, staff effort, etc) and have the experiences been positive?
I would recommend looking up Apache Solr , it is IMHO the best scalable, feature-rich search server out there. A F/OSS out-of-the-box solution from Apache Software Foundation and used by organizations such as Netflix, AOL, CNet etc. We had used GSA in our company for an year before moving to Solr. The move was relatively painless compared to the benefits accrued.
Since it integrates with a RESTful interface it can be integrated into your platform of choice without language/platform tie-ins. Give it a whirl!
We are currently moving from Google (GSA) to Microsoft FAST (specifically FSIS).
The reason is simple, we are not satisified with the Google experiance from a supportablity and manageability perspective. We have chossen FAST because it gives us a platform that can scale as our needs grow over the next few years. Also it gives us a very fine level of control. What I mean is it will give us the ability to define custom fields and then control how these fields are populated.
The company I work for is a Google GSA partner and has developed a solution on top of the GSA. We also have a cloud solution with very similar benefits to the GSA and a host of things that the GSA can't do - like scale geographically, scale with load, upload data and have it in the index in near real-time, have nested records, deal with hierarchy etc...
In our experience, the people who migrated from the GSA to the Cloud solution did so for the following reasons.
Primarily, they did not want to manage hardware.
Most of our customers are ecommerce / media companies, and they had a lot of navigation. The GSA search throughput really struggles when you have a lot of navigations / refinements. For example if you have 20 navigations, the throughput drops from around 50 queries per second to about 12.
Indexing time - the GSA has a minimum of 7 minutes for something to show up in the index, and for ecomm / media these times are unacceptable.
GroupBy has written migration tools to allow the smooth transition from GSA --> Cloud and also the cloud platform accepts the same format that the GSA accepts.
Have the experiences been positive? Well, clearly I'm going to be biased and say yes, but there are hard conversion increases that support the clients positivity. :-)
More details at: www.groupbyinc.com

Embed Google/ Yahoo search into a web site or build your own

I am looking for an opinion on the whether to use Google custom search, Yahoo search builder or build my own for web projects (no more than 100 pages of content). If I should build my own - do you have any fast start kits you could recommend?
Many thanks
Chris
I have had success using OpenSearch for my personal blog.
While working at BigCorp we used dedicated search applicances in yellow boxes, but in your case (around 100 pages) it does not make sense to take such a route.
I would suggest going with either Google Custom Search, or Yahoo Search Builder (as long as they both index your site sufficiently to provide good results).
More often than not, you'll get better results and you don't have to worry about building your own custom engine (or implementing an off the shelf/open source piece of software to do the job for you).
I've used IBM OmniFind Yahoo Edition and had fantastic results with it. You are limited to a single index per implementation but it's very fast and easy to integrate with and extensible in terms of search customization. I've used it with a ASP.NET site without issue. A caveat being that it needs to be installed on the server and running as a service so it is out of the question for most shared hosting. It has the index capabilities of general search engines (pdf/html/etc) which is very nice.
Edit:
I forgot to mention that some of the reasons I liked it vs other options is that it is free and doesn't require any additional hardware, just FYI.
The main situation I see Google/Yahoo as being sub-optimal is when your site relies on up-to-the-minute results. You're at the mercy of their crawling policies/speed/etc. If that's okay (and I suspect it will be for most 100ish page sites), use them - the results will be great. If realtime results are important, you may have to bite the bullet and install something locally.
Yahoo boss is cheaper and recommended by many people
I am going to integrate it soon.

Resources