As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I want to implement search functionality for a website (assume it is similar to SO). I don't want to use Google search of stuff like that.
My question is:
How do I implement this?
There are two methods I am aware of:
Search all the databases in the application when the user gives his query.
Index all the data I have and store it somewhere else and query from there (like what Google does).
Can anyone tell me which way to go? What are the pros and cons?
Better, are there any better ways to do this?
Use lucene,
http://lucene.apache.org/java/docs/
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
It is available in java and .net. It is also in available in php in the form of a zend framework module.
Lucene does what you wanted(indexing of the searched items), you have to keep track of a lucene index but it is much better than doing a database search in terms of performance. BTW, SO search is powered by lucene. :D
It depends on how comprehensive your web site is and how much you want to do yourself.
If you are running a a small website without further possibilities to add a custom search, let google do the work (maybe add a sitemap) and use the google custom search.
If you run a medium site with an sql engine use the search features of your sql engine.
If you run some heavier software stack like J2EE or .Net use Lucene, a great, powerful search engine or its .Net clone lucene.Net
If you want to abstract your search from your application and be able to query it in a language neutral way with XML/HTTP and JSON APIs, have a look at solr. Solr runs lucene in the background, but adds a nice web interface to it.
You might want to have a look at xapian and the omega front end. It's essentially a toolkit on which you can build search functionality.
The best way to approach this will depend on how you construct your pages.
If they're frequently composed from a lot of different records (as I imagine stack overflow pages are), the indexing approach is likely to give better results unless you put a lot of work into effectively reconstructing the pages on the database side.
The disadvantage you have with the indexing approach is the turn around time. There are workarounds (like the Google's sitemap stuff), but they're also complex to get right.
If you go with database path, also be aware that modern search engine systems function much better if they have link data to process, so finding a system which can understand links between 'pages' in the database will have a positive effect.
If you are on Microsoft plattform you could use the Indexing service. This integrates very easliy with IIS websites.
It has all the basic features like full text search, ranking, exlcude and include certain files types and you can add your own meta information as well via meta tags in the html pages.
Do a google and you'll find tons!
This is somewhat orthogonal to your question, but I highly recommend the idea of a RESTful search. That is, to perform a search that has never been performed, the website POSTs a query to /searches/. To re-run a search, the website GETs /searches/{some id}
There are some good documents to be found regarding this, for example here.
(That said, I like indexing where possible, though it is an optimization, and thus can be premature.)
If you application uses the Java EE stack and you are using Hibernate you can use the Compass Framework maintain a searchable index of your database. The Compass Framework uses Lucene under the hood.
The only catch is that you cannot replicate your search index. So you need to use a clustered database to hold the index tables or use the newer grid based index storage mechanisms that have been added to the Compass Framework 2.x.
Related
I am currently looking into finding a search device that can facilitate a lot of documents and a few different websites and an LMS.
Where this differs is that we would like there to be a heavy amount of relevancy based on user roles. Everyone is auto-logged into all of our systems via SSO much like this site. We want heavy weighting to be put on documents, web site articles/knowledgebase, and class in the LMS that are for that user's selected role.
I personally have limited knowledge of solr which we use for some full text searches. I have considered looking into elasticsearch, solr, google appliance, and FAST.
Do any of these have any innate features that will help me get to my end goal faster? My worries about elasticsearch and solr is the amount of development time. Our group has done limited search customization so also wondering on dev time needed for various solutions.
You must have not done much research yet, because FAST is gone (absorbed by Microsoft) and Google Appliance is not very customizable.
Solr is more customizable than Elasticsearch, so that's my recommendation. For the rest, I would start by having the field for a role and then using that as a boost factor.
If the basic approach does not work, the Solr-Users mailing list may actually be a better place to follow-up as it allows a discussion to fine tune the issue.
(Updated)
For more packaged solutions that integrate with Solr and include crawling, you can look at:
Apache Nutch to add crawling specifically
Apache ManifoldCF - has a lot of integrators into other data sources
LucidWorks Fusion (commercial)
Flexile search platform (commercial)
Cloudera - is a Big Data solution, but it integrates with Solr and - may - have a crawler. It does have a nice UI Hue
And probably more, if you dig a little bit.
Basic requirements:
Should be able to index things like MediaWiki, Confluence, Sharepoint, GitHub:Enterprise, Askbot
Should be reasonably smart about de-duping results (one reason Confluence search is so painful).
Should definitely incorporate heuristics like how many pages link to a document, whether the search terms are in the title of the document, etc. If there's a way for users to downrank particular results, that might be a bonus.
Should be somewhat tunable (e.g., prefer Confluence over Sharepoint, blacklist certain paths).
Are there off-the-shelf products that can do the above? FOSS projects? Are there FOSS projects that can provide the basics for the above and are easy to extend or build a frontend for?
You can try Apache Solr, it's a great tool.
According to the website:
Solr is the popular, blazing fast open source enterprise search
platform from the Apache Lucene project. Its major features include
powerful full-text search, hit highlighting, faceted search, near
real-time indexing, dynamic clustering, database integration, rich
document (e.g., Word, PDF) handling, and geospatial search. Solr is
highly reliable, scalable and fault tolerant, providing distributed
indexing, replication and load-balanced querying, automated failover
and recovery, centralized configuration and more. Solr powers the
search and navigation features of many of the world's largest internet
sites.
You could try a bundled version of Solr and other tools such as OpenESP or Constellio. Expect to spend some time tuning the sources and imports. ManifoldCF which is bundled with OpenESP is an open source connector/crawler framework for plugging in connectors to various systems like those you describe, and several connectors come out of the box.
You can try Moogle. It is open source easily employable in windows with IIS. just having look as google so you feel bit familiar with it. Try http://techstuff.smsjuju.com/intranet-search-engine/
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
We're looking for a CMS that we can use as the basis for a new product we're rolling out.
As it's principally a content based thing, we need to base everything on a CMS, but there's a few things we need:
As we're supporting tens - hundreds of users, we ideally need a multi-tenant CMS (single shared code base), that can support different designs per site
As we're selling in functionality, we need something that will let us deploy a new 'module' and switch it on/off on a per site basis
We prefer stuff that is open source (PHP or Rails, that sort of thing)
Before I consider building something, is there anything out there that's any good?
Now I am biased, but dotCMS 1.9 is a flexible open source solution (java) that was designed to make running tens or hundreds of sites within a single instance easy. You can create site "templates" and use them again and again as needed them. Sites can share content, assets and templates, or not share anything depending on how you set them up. Users can have access to manage one site or many sites - their views into the management tool are limited by their permissions (as you'd expect). Again, I obviously am biased as I work for the company, but this is exactly the problem that dotCMS 1.9 was designed to solve.
Plone sounds like it'd do what you want.
It's written in Python, on top of Zope, and supports multiple distinct sites (with distinct and/or shared users, groups, styling). Extra functionality is added through 'products'; there are a number of Free extensions and it's quite easy to write your own too.
We use http://www.alfresco.com/ ...seems to fit your definition . Different designs per site can be achieved with what they call "web scripts" . It supports deployment and branching infrastructure that you can leverage to for your different clients
As we're supporting tens - hundreds of users, we ideally need a multi-tenant CMS (single shared code base), that can support different designs per site
My first thought when I read that was WordpressMU (perhaps with Buddypress if you need groups, etc?), but it might not be "CMS" enough for your needs... you don't elaborate on which features of a CMS you are looking for (media management, workflows, etc), so it's a bit hard to recommend one.
DotNetNuke supports multi-tenant operation, and has a fairly active marketplace for add on modules, skins etc. It has pretty well defined module development interfaces as well.
Yanel is a Java/XML/XSLT based CMS (Apache 2.0 license) designed for multi-tenancy and one can run arbitrary many sites inside the same Yanel instance, whereas see in particular the documentation on 'realms'.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I have an application that generates around 10000 printed pages per month. Each report (around 2000/month) is archived as PDF on a simple network
file share.
I am searching for a Document Management System meeting the following requirements:
watch the archive folder and update the index either on regular basis or when changes are detected
provide an Intranet Webpage where users can search documents based on filenames, timespans and other relevant file attributes
fulltext search
can handle large/substantially growing archives
To be clear, I am searching for a pre-built solution here, commercial products are accepted.
Sounds like Microsoft Search Server 2008 Express would be a good candidate. Free and installs in a couple of minutes.
I can suggest you google docs. AFAIK It can handle all your requirements.
This is a very vague question and I'm not quite sure how to respond.
It looks like you want a way to index all your files and ensure that the information is kept up to date in the database. What I can suggest is you look into some search servers like:
Sphinx
Solr
These both take some setup but they handle all your requirements: They can easily be setup to watch a folder and keep your index up to date, they provide great fulltext search, they can be accessed via an intranet webpage if you setup a page to search your database, and they are used for enormous operations so large archives shouldn't be a problem.
If you're looking for a pre-built solution, I'm not sure what to mention.
Plone could work pretty well for your needs. It has plugins for indexing PDF content, and you can customize the metadata. Also, it has a fantastic web interface with built-in search. The best part is that it's free and easy-to-use, and if your needs grow, you can pay for support.
My only recommendation (at first glance) is that you store your content on the file system and not in the Zope OO database. You should only store your metadata and index data in the database. This is a pretty common way of storing large amounts of content in the document management world.
Hope that helps!
Tom Purl
As Tom said Plone does to what you describe. It has build in full text search that relies on the commandline programm pdftotext for pdfs to be in the path. There are several Extension you may me interested in:
Reflecto - Watches a part of the
filesystem and alows searching and
displaying it inside Plone:
See reflecto on the plone.org/products
TextIndexNG 3 - Indexing extentions written for a publishing house
http://www.zopyx.com/projects/TextIndexNG3/textindexng3-the-leading-fulltext-indexing/
or
collective.solr - use the search enging "solr" to drive the catalog:
See collective.solr on plone.org/products
(Sorry, missing links due to stackoverflows new user policy)
I am working on a website that currently has a number of disparate search functions, for example:
A crawl 'through the front door' of the website
A search that communicates with a web-service
etc...
What would be the best way to tie these together, and provide what appears to be a unified search function?
I found the following list on wikipedia
Free and open source enterprise search software
Lucene and Solr
Xapian
Vendors of proprietary enterprise search software
AskMeNow
Autonomy Corporation
Concept Searching Limited
Coveo
Dieselpoint, Inc.
dtSearch Corp.
Endeca Technologies Inc.
Exalead
Expert System S.p.A.
Funnelback
Google Search Appliance
IBM
ISYS Search Software
Microsoft (includes Microsoft Search Server, Fast Search & Transfer):
Open Text Corporation
Oracle Corporation
Queplix Universal Search Appliance
SAP
TeraText
VivĂsimo
X1 Technologies, Inc.
ZyLAB Technologies
Thanks for any advice regarding this.
Solr is an unbelievably flexible solution for search. Just in the last year I coded 2 solr-based websites and worked on a third existing one, each worked in a very different way.
Solr simply eats XML requests to add something to index, and XML requests to search for something inside an index. It doesn't do crawling or text extraction for you, but most of the time these are easy to do. There are many existing addons to Solr/Lucene stack so maybe something for you already exists.
I would avoid proprietary software unless you're sure Solr is insufficient. It's one of the nicest programs I've worked with, very flexible when you need it and at the same time you can start in minutes without reading long manuals.
Note that no matter what search solution you use, a search setup is "disparate" by nature.
You will still have an indexer, and a search UI, or the "framework".
You WILL corner yourself by marrying a specific search technology. You actually want to have the UI as separate from the search backend as possible. The backend may stop to scale, or there may be a better search engine out there tomorrow.
Switching search engines is very common, so never - ever - write your interface with a specific search engine in mind. Always abstract it, so the UI is not aware of the actual search technology used.
Keep it modular, and you will thank yourself later.
By using a standard web services interface, you can also allow 3rd parties to build stuff for you, and they won't have to "learn" whatever search engine you use on the backend.
Take a look at these similar questions:
Best text search engine for integrating with custom web app?
How do I implement Search Functionality in a website?
My personal recommendation: Solr.
All these companies offer different features of Universal Search. Smaller companies carved themselves very functional and extremely desired niches. For example Queplix enables any search engine to work with structured data and enterprise applications by extracting the data, business objects, roles and permissions from all indexed applications. It provides enterprise-ranking criteria as well as data-compliance alerts.
Two other solutions that weren't as well-known &/or available around the time the original question was asked:
Google Custom Search - especially since the disable public URL option was recently added
YaCy - you can join the network or download and roll your own independent servers