What skill set is needed to set up Solr or ElasticSearch? - search

Two clients of mine are evaluating setting up a search server, either Solr or ElasticSearch. We're wondering what programming languages (if any) and development environments are necessary to get the search servers running. Can it be done by people mostly familiar with front end technologies (HTML/CSS/JavaScript) or is more serious coding skill needed (e.g. understanding of multithreading/ advanced debugging/ other "pro-level" concepts)?
If only light programming skills are needed I'm playing with the thought of suggesting to set it up myself. I have very little Java knowledge but have basic understanding of C, ActionScript, Pascal and even Simula in addition to aforementioned front end technologies. I know basic search architecture from my time in FAST (an enterprise search vendor).
Best, Bjørn

Bit of a broad question but i'll try to give it a shot:
You don't need any programming language in particular. They're both stand alone servers which have API's which are addressable from any programming language.
ElasticSearch has a really nice API that's JSON/REST based.
SOLR's API is a lot more clunky, but also supports XML.
(If I have a choice I tend to go for ElasticSearch, unless there's a really specialized feature I need that's only in SOLR).
Getting up and running doesn't really require any knowledge of any programming language in particular.
The only time you NEED java is when you decide you end up needing custom plugins to SOLR/ElasticSearch itself.
You don't need any specific IDE's beyond those matching your programming language of choice.
When trying to figure out what's going on inside my elasitc search server I do like elastic search HEAD:
http://mobz.github.io/elasticsearch-head/
Hope this helps.

As pointed out already, this is quite a broad question, most likely get closed. But I'll give it a go too.
Both ElasticSearch and Solr are quite easy to get started with. They come as a zip/tar.gz archive that you can extract.
Both require JVM, so you need Java setup.
Once setup, playing with either is quite easy, you do not need any advanced programming skills to play around with it. Solr comes with an Admin UI page, that allows you to execute queries.
Elastic Search has clients as #Constantijin has pointed out. Elastic-head is an excellent choice.
You will need quite a detailed understanding of the Lucene ecosystem, its architecture, plugins etc. Given you have an understanding of another Search Engine, the concepts around indexing and text processing should be easy enough for you.
If you want to write something more advanced than the Admin UI, and you can use Javascript.
You can use AjaxSolr for making ajax requests to your Solr instance
For ElasticSearch, you can try using Elastic.js.

Elasticsearch is an open-source search engine built on top of Apache Lucene™, a full-text search-engine library. Lucene is arguably the most advanced, high-performance, and fully featured search engine library in existence today—both open source and proprietary.
However, Elasticsearch is much more than just Lucene and much more than “just” full-text search. It can also be described as follows:
A distributed real-time document store where every field is indexed and searchable
A distributed search engine with real-time analytics
Capable of scaling to hundreds of servers and petabytes of structured and
unstructured data
I would like add more details regarding how to used ElasticSearch in php language check out - http://www.multidots.com/what-is-elasticsearch

[How to integrate ElasticSearch with PHP?][1]
By using curl, you can use ElasticSearch with your favorite programming language. Here is the example of simple curl request with ElasticSearch.
- PHP Sample Script:
You can find PHP client api on github:
[https://github.com/elastic/elasticsearch-php][2]
Check out Best Article on Elasticsearch - http://www.multidots.com/what-is-elasticsearch

Related

Recommended Approaches for building/designing a search engine for my website

I would like to build a search engine for my website so I can quickly find relevant content. I've done quite a few google searches, discovered ElasticSearch and Solr (which both sit on top of Lucene), and whoosh (python-based).
But are all of these search engines just building an "inverted-index" on top of the data? What are some other algorithmic approaches for getting higher quality searches?
I was intrigued by this blog post using collaborative filtering on top of Solr, which returns related search queries:
http://www.opensourceconnections.com/2013/08/25/semantic-search-with-solr-and-python-numpy/
Are there other common techniques that I should be aware of? Are there other libraries sitting on top of ElasticSearch/Solr that I could just plug into, and use "out-of-the-box"?
Any links or tips would be greatly appreciated!
You haven't mentioned what tech stack you are working on.
If you use Ruby on Rails, I would recommend Tire, which is a gem that gives a DSL wrapper over ElasticSearch. Essentially, it allows you to index your data in Elasticsearch.
For Rails, Sunspot is a very popular gem that people use to interface with Solr.
For .NET - SolrNET is a great Solr client.
Other part of your question (around implementing a good search engine) is too broad - I would recommend reading a good book such as Lucene in Action to get a feel of what Solr/Elasticsearch could do.
I do have a few notes that I wrote a while back, you can read about some of my experience in search here.
Edit:
Since you work on python, I would recommend Haystack, although it is specific to Django. It is very versatile for our needs. However, if you are not using django, I can think of solrpy as a Solr client. Haystack works with both Solr and Elasticsearch.
i suggest you to learn Solr API, cause it was developed since 4 5 years so you can find lots of plug-ins like related search API in Solr, But in elastic search it is very easy to configure however it is very young engine so needs to be developed more.
Pyes is a well-documented Python client for Elasticsearch.
Also, this Youtube video provides a good overview of using Elasticsearch with Python.
I suggest you to use Google Custom Search Engine.
Here have a look.
https://www.google.com/cse/all
We have developed several search engines both on Solr and Elastic. Solr used to be the best as it provided most of the tools needed to admin and debug your indexes. Right now Elastic offers the same features as Solr either natively or via plugins. Plus it is easier to configure in high performance/high availability scenarios (easy to shard or cluster).
Your technology stack is irrelevant. Both Solr and Elastic have clients nearly for every language, plus you can access both via plain HTTP:
That said, each search engine applies to a problem domain. Tunning Elastic or Solr to retrieve relevant results is a bit of an art with some trial and error.
You will have to define analyzers for each field you'll search on and according to your search patterns and the kind of results you will be expecting.
Eventually, to create search engines with a single input that search across disparate attributes of a document type, may need the use of DisMax queries where you can boost results depending on the matching of the search terms to specific document fields.
To summarize: go for Elastic, and get some plugins or frontends. Two suggestions:
Inquisitor: for testing your analyzers
Elastic Head: for administration purposes

Solr vs. ElasticSearch [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
What are the core architectural differences between these technologies?
Also, what use cases are generally more appropriate for each?
Update
Now that the question scope has been corrected, I might add something in this regard as well:
There are many comparisons between Apache Solr and ElasticSearch available, so I'll reference those I found most useful myself, i.e. covering the most important aspects:
Bob Yoplait already linked kimchy's answer to ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage?, which summarizes the reasons why he went ahead and created ElasticSearch, which in his opinion provides a much superior distributed model and ease of use in comparison to Solr.
Ryan Sonnek's Realtime Search: Solr vs Elasticsearch provides an insightful analysis/comparison and explains why he switched from Solr to ElasticSeach, despite being a happy Solr user already - he summarizes this as follows:
Solr may be the weapon of choice when building standard search
applications, but Elasticsearch takes it to the next level with an
architecture for creating modern realtime search applications.
Percolation is an exciting and innovative feature that singlehandedly
blows Solr right out of the water. Elasticsearch is scalable, speedy
and a dream to integrate with. Adios Solr, it was nice knowing you. [emphasis mine]
The Wikipedia article on ElasticSearch quotes a comparison from the reputed German iX magazine, listing advantages and disadvantages, which pretty much summarize what has been said above already:
Advantages:
ElasticSearch is distributed. No separate project required. Replicas are near real-time too, which is called "Push replication".
ElasticSearch fully supports the near real-time search of Apache
Lucene.
Handling multitenancy is not a special configuration, where
with Solr a more advanced setup is necessary.
ElasticSearch introduces
the concept of the Gateway, which makes full backups easier.
Disadvantages:
Only one main developer [not applicable anymore according to the current elasticsearch GitHub organization, besides having a pretty active committer base in the first place]
No autowarming feature [not applicable anymore according to the new Index Warmup API]
Initial Answer
They are completely different technologies addressing completely different use cases, thus cannot be compared at all in any meaningful way:
Apache Solr - Apache Solr offers Lucene's capabilities in an easy to use, fast search server with additional features like faceting, scalability and much more
Amazon ElastiCache - Amazon ElastiCache is a web service that makes it easy to deploy, operate, and scale an in-memory cache in the cloud.
Please note that Amazon ElastiCache is protocol-compliant with Memcached, a widely adopted memory object caching system, so code, applications, and popular tools that you use today with existing Memcached environments will work seamlessly with the service (see Memcached for details).
[emphasis mine]
Maybe this has been confused with the following two related technologies one way or another:
ElasticSearch - It is an Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Apache Lucene.
Amazon CloudSearch - Amazon CloudSearch is a fully-managed search service in the cloud that allows customers to easily integrate fast and highly scalable search functionality into their applications.
The Solr and ElasticSearch offerings sound strikingly similar at first sight, and both use the same backend search engine, namely Apache Lucene.
While Solr is older, quite versatile and mature and widely used accordingly, ElasticSearch has been developed specifically to address Solr shortcomings with scalability requirements in modern cloud environments, which are hard(er) to address with Solr.
As such it would probably be most useful to compare ElasticSearch with the recently introduced Amazon CloudSearch (see the introductory post Start Searching in One Hour for Less Than $100 / Month), because both claim to cover the same use cases in principle.
I see some of the above answers are now a bit out of date. From my perspective, and I work with both Solr(Cloud and non-Cloud) and ElasticSearch on a daily basis, here are some interesting differences:
Community: Solr has a bigger, more mature user, dev, and contributor community. ES has a smaller, but active community of users and a growing community of contributors
Maturity: Solr is more mature, but ES has grown rapidly and I consider it stable
Performance: hard to judge. I/we have not done direct performance benchmarks. A person at LinkedIn did compare Solr vs. ES vs. Sensei once, but the initial results should be ignored because they used non-expert setup for both Solr and ES.
Design: People love Solr. The Java API is somewhat verbose, but people like how it's put together. Solr code is unfortunately not always very pretty. Also, ES has sharding, real-time replication, document and routing built-in. While some of this exists in Solr, too, it feels a bit like an after-thought.
Support: there are companies providing tech and consulting support for both Solr and ElasticSearch. I think the only company that provides support for both is Sematext (disclosure: I'm Sematext founder)
Scalability: both can be scaled to very large clusters. ES is easier to scale than pre-Solr 4.0 version of Solr, but with Solr 4.0 that's no longer the case.
For more thorough coverage of Solr vs. ElasticSearch topic have a look at https://sematext.com/blog/solr-vs-elasticsearch-part-1-overview/ . This is the first post in the series of posts from Sematext doing direct and neutral Solr vs. ElasticSearch comparison. Disclosure: I work at Sematext.
I see that a lot of folks here have answered this ElasticSearch vs Solr question in terms of features and functionality but I don't see much discussion here (or elsewhere) regarding how they compare in terms of performance.
That is why I decided to conduct my own investigation. I took an already coded heterogenous data source micro-service that already used Solr for term search. I switched out Solr for ElasticSearch then I ran both versions on AWS with an already coded load test application and captured the performance metrics for subsequent analysis.
Here is what I found. ElasticSearch had 13% higher throughput when it came to indexing documents but Solr was ten times faster. When it came to querying for documents, Solr had five times more throughput and was five times faster than ElasticSearch.
Since the long history of Apache Solr, I think one strength of the Solr is its ecosystem. There are many Solr plugins for different types of data and purposes.
Search platform in the following layers from bottom to top:
Data
Purpose: Represent various data types and sources
Document building
Purpose: Build document information for indexing
Indexing and searching
Purpose: Build and query a document index
Logic enhancement
Purpose: Additional logic for processing search queries and results
Search platform service
Purpose: Add additional functionalities of search engine core to provide a service platform.
UI application
Purpose: End-user search interface or applications
Reference article : Enterprise search
I have been working on both solr and elastic search for .Net applications.
The major difference what i have faced is
Elastic search :
More code and less configuration, however there are api's to change
but still is a code change
for complex types, type within types i.e nested types(wasn't able to achieve in solr)
Solr :
less code and more configuration and hence less maintenance
for grouping results during querying(lots of work to achieve in
elastic search in short no straight way)
I have created a table of major differences between elasticsearch and Solr and splunk, you can use it as 2016 update:
While all of the above links have merit, and have benefited me greatly in the past, as a linguist "exposed" to various Lucene search engines for the last 15 years, I have to say that elastic-search development is very fast in Python. That being said, some of the code felt non-intuitive to me. So, I reached out to one component of the ELK stack, Kibana, from an open source perspective, and found that I could generate the somewhat cryptic code of elasticsearch very easily in Kibana. Also, I could pull Chrome Sense es queries into Kibana as well. If you use Kibana to evaluate es, it will further speed up your evaluation. What took hours to run on other platforms was up and running in JSON in Sense on top of elasticsearch (RESTful interface) in a few minutes at worst (largest data sets); in seconds at best. The documentation for elasticsearch, while 700+ pages, didn't answer questions I had that normally would be resolved in SOLR or other Lucene documentation, which obviously took more time to analyze. Also, you may want to take a look at Aggregates in elastic-search, which have taken Faceting to a new level.
Bigger picture: if you're doing data science, text analytics, or computational linguistics, elasticsearch has some ranking algorithms that seem to innovate well in the information retrieval area. If you're using any TF/IDF algorithms, Text Frequency/Inverse Document Frequency, elasticsearch extends this 1960's algorithm to a new level, even using BM25, Best Match 25, and other Relevancy Ranking algorithms. So, if you are scoring or ranking words, phrases or sentences, elasticsearch does this scoring on the fly, without the large overhead of other data analytics approaches that take hours--another elasticsearch time savings.
With es, combining some of the strengths of bucketing from aggregations with the real-time JSON data relevancy scoring and ranking, you could find a winning combination, depending on either your agile (stories) or architectural(use cases) approach.
Note: did see a similar discussion on aggregations above, but not on aggregations and relevancy scoring--my apology for any overlap.
Disclosure: I don't work for elastic and won't be able to benefit in the near future from their excellent work due to a different architecural path, unless I do some charity work with elasticsearch, which wouldn't be a bad idea
If you are already using SOLR, remain stick to it. If you are starting up, go for Elastic search.
Maximum major issues have been fixed in SOLR and it is quite mature.
Imagine the use case:
A lot(100+) of small(10Mb-100Mb, 1000-100000 documents) search indexes.
They are using by a lot of applications (microservices)
Each application can use more than one index
Small by size index, yes. But huge load(hundreds search-requests per second) and requests are complex (multiple aggregations, conditions and so on)
Downtimes are not allowed
All of that is working years long, and constantly growing.
Idea to have individual ES instance per each index - is huge overhead in this case.
Based on my experience, this kind of use case is very complex to support with Elasticsearch.
Why?
FIRST.
The major problem is fundamental back compatibility disregard.
Breaking changes are so cool!
(Note: imagine SQL-server which require you to do small change in all your SQL-statements, when upgraded... can't imagine it. But for ES it's normal)
Deprecations which will dropped in next major release are so sexy!
(Note: you know, Java contain some deprecations, which 20+ years old, but still working in actual Java version...)
And not only that, sometimes you even have something which nowhere documented (personally came across only once but... )
So. If you want to upgrade ES (because you need new features for some app or you want to get bug fixes) - you are in hell. Especially if it is about major version upgrade.
Client API will not back compatible. Index settings will not back compatible.
And upgrade all app/services same moment with ES upgrade is not realistic.
But you must do it time to time. No other way.
Existing indexes is automatically upgraded? - Yes. But it not help you when you will need to change some old-index settings.
To live with that, you need constantly invest a lot of power in ... forward compatibility of you apps/services with future releases of ES.
Or you need to build(and anyway constantly support) some kind of middleware between you app/services and ES, which provide you back compatible client API.
(And, you can't use Transport Client (because it required jar upgrade for every minor version ES upgrade), and this fact do not make your life easier)
Is it looks simple & cheap? No, it's not. Far from it.
Continuous maintenance of complex infrastructure which based on ES, is way to expensive in all possible senses.
SECOND.
Simple API ? Well... no really.
When you is really using complex conditions and aggregations.... JSON-request with 5 nested levels is whatever, but not simple.
Unfortunately, I have no experience with SOLR, can't say anything about it.
But Sphinxsearch is much better it this scenario, becasue of totally back compatible SphinxQL.
Note:
Sphinxsearch/Manticore are indeed interesting. It's not Lucine based, and as result seriously different. Contain several unique features from the box which ES do not have and crazy fast with small/middle size indexes.
I have use Elasticsearch for 3 years and Solr for about a month, I feel elasticsearch cluster is quite easy to install as compared to Solr installation. Elasticsearch has a pool of help documents with great explanation. One of the use case I was stuck up with Histogram Aggregation which was available in ES however not found in Solr.
Add an nested document in solr very complex and nested data search also very complex. but Elastic Search easy to add nested document and search
I only use Elastic-search. Since I found solr is very hard to start.
Elastic-search's features:
Easy to start, very few setting. Even a newbie can setup a cluster step by step.
Simple Restful API which using NoSQL query. And many language libraries for easy accessing.
Good document, you can read the book: . There is a web version on official website.

Solr / Lucene / Search Hosting

I need some sort of hosted search API for my website where I can submit content and search content with fuzzy logic, where spelling mistakes and grammar won't affect results.
I want to use solr/lucene or whatever technology is out there, without needing to install stuff on my server to reduce setup complexity.
What solr/lucene/othersearch hosting services are there?
I'm read some other posts on stackoverflow, but they are either no longer in business or are wordpress extensions that require server installation (i.e. the processing is done on the server).
You might consider Websolr, of which I am a cofounder, which is exactly the sort of service that you describe.
The thing is, Solr is highly dependant on its datamodel. Or rather how your users search will really affect the way you structure the data model in Solr. As far as I know there aren’t any really good hosting services for Solr yet because you almost always need to do such extensive modifications to the Solr configuration (most notably the schema.xml).
However, with that said, Solr is really easy to get up and running. The example application is bundled with Jetty and runs more or less directly after download.
So unless you have immense scaling issues (read 5-10+ milj documents or a really high query per second load) I’d recommend you to actually install the application on your own server.
Amazon CloudSearch is the best alternate if you do not want to worry about hosting.
http://aws.amazon.com/cloudsearch/
http://docs.amazonwebservices.com/cloudsearch/latest/developerguide/SvcIntro.html
gotosolr - http://gotosolr.com/en
Apache Solr indexes are distributed on 2 hosting companies.
Security is managed by Https and basic http authentication.
Real-time statistics.
Also ready for agencies with multi-accounts and
multi-subscriptions.
Supports Drupal and WPSOLR (https://wordpress.org/plugins/wpsolr-search-engine/)

Did Atlassion build JIRA Query Language (JQL) from scratch?

My company is looking at advanced search and reporting solutions, and are considering (among other options) creating something akin to JIRA's JQL for maximum flexibility.
My googling leads me to believe Atlassian built JQL from scratch, at least as a language with syntax and a parser, but I thought I'd try SO before concluding. Anyone know, at a high level, how they did it? Was there one or more Open Source project they based it on?
(Kudos to Atlassian either way - JQL is gorgeous!)
I think they did it from scratch. The underlying architecture is crisp but quite complex. It took me a good few hours to get it, just reading the source and minimal user docs.
~Matt
Atlassian built JQL on top of Apache Lucene. You might want to take a look at Elasticsearch or Solr, which are open source alternatives, also built on Lucene.
I have been using Jira for a year and I notice "Apache Lucene" on the the directory, and before this I got a job wherein I was force to learn apache solr. So in conclusion, Jira is using Apache Lucene as a searching library which is also used was being used in Solr.
for more info read this:
http://www.lucenetutorial.com/lucene-vs-solr.html

Best text search engine for integrating with custom web app?

We have a web app that allows users to upload documents, create their own documents, and so on. Uploaded files are stored on Amazon S3, created information is stored in a MySQL database. What I'm looking for is some sort of search engine, where I feed it all of our text documents, each with a unique ID, and it builds an index or whatever. Later, I can give it search queries, and it will pull out the best matching documents (via their ID), along with snippets of matching text.
Basically we want to allow our users to search through their repository of uploaded stuffs, along with anything that other users have marked as public. The solution should run on a standard Linux server, and ideally it would be open source, but I'll also consider paid solutions if they aren't outrageously priced.
So far, I've found three potential candidates:
MySQL Full Text Search - some reports I've read are that it's very slow
Apache Lucene - unfortunately written in Java, but I'll use it if I have to. Supposedly fast
Sphinx - doesn't seem to be as popular, ideally whatever solution I find will have lots of community support.
Please let me know if there are any other good choices that I've overlooked, or if you have experience with any of the above.
Take a look at Solr. It's based on Lucene, so it's very fast, and it's really easy to use from any platform.
Sphinx may be worth your consideration, as it works well with several common RDMS (notably MySQL)
There is also Xapian which is fast and is quite customizable.
It has support for custom indexers allowing one to index data that is not stored in a database which might be useful for your documents stored on S3.
I imagine that Google will have a solution that meets your needs. Start here: Google Enterprise
There is a Ruby port of Lucene called "Ferret". In addition to the Ruby API, you can get at the underlying c implementation called "cFerret".
Lucene is very good. And although it was originally written in java there is a php implementation http://framework.zend.com/manual/en/zend.search.lucene.html

Resources