Beginner's guide to ElasticSearch [closed]

Beginner's guide to ElasticSearch [closed] - search

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
There hasn't been any books about ElasticSearch (that I know of), and http://www.elasticsearch.org/guide/ seems to contain only references.
Any good beginner's guide or tutorials, perhaps by examples, to recommend, especially in terms of the different mapping and indexing strategies?

Edit (April 2015):
As many have noticed, my old blog is now defunct. Most of my articles were transferred over to the Elastic blog, and can be found by filtering on my name: https://www.elastic.co/blog/author/zachary-tong
To be perfectly honest, the best source of beginner knowledge is now Elasticsearch - The Definitive Guide written by myself and Clinton Gormley.
It assumes zero search engine knowledge and explains information retrieval first principals in context of Elasticsearch. While the reference docs are all about finding the precise parameter you need, the Guide is a narrative that discusses problems in search and how to solve them.
Best of all, the book is OSS and free (unless you want to buy a paper copy, in which case O'Reilly will happily sell you one :) )
Edit (August 2013):
Many of my articles have been migrated over to the official Elasticsearch blog, as well as new articles that have not been published on my personal site.
Original post:
I've also been frustrated with learning ElasticSearch, having no Lucene/Solr experience. I've been slowly documenting things I've learned at my blog, and have four tutorials written so far:
So I don't have to keep editing, all future tutorials on my blog can be found under this category link.
And these are some links that I have bookmarked, because they have been incredibly helpful in one way or another:
Thinking through and debugging problems with your query
Another example of complicated mapping (ngram, synonyms, phonemes)
Searching parts of a word
Fun with ElasticSearch's children and nested documents

You can Learn the overview using this link
http://spinscale.github.com/elasticsearch/2012-03-jugm.html#/1

I found Elastic Search one of the hardest things I've had to learn, I hadn't used Lucene before and I found the documentation to be quite hard to follow.
These are the things that I wish I'd known before I started learning it:
Configuration and setup
I configured ELS to run on 3 VM' using Centos, Mint and Ubuntu. Centos was by far the best choice of the three.
I followed this guide to help me set it up (it worked fine on all three distros)
Index and types
One Index can contain many types, it's by using types that you can achieve a good degree of separation of data that belongs within the same index.
PHP
I use PHP as a front end and used this wrapper to integrate my ELS installation into my scripts.
Other resources
The presentation in the other answer to your question is really good, go through it and learn the DSL Query syntax, once setup this is where the real power of ELS comes into its own.

If you're new to elasticsearch and the “information retrieval” / “fulltext search” in general, my advice would be to check these resources first, before trying out tutorials on specific features:
The Your Data, Your Search, ElasticSearch presentation from EURUKO 2011
The ElasticSearch - A Distributed Search Engine talk by Shay Bannon together with accompanying scripts
The Lucene in Action book (at least the general chapters on the indexing, analysis, tokenization, and constructing queries)

Related

Is there a need to code a new 3D engine? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Coding a new 3D engine is fascinating but I there are so many out there. Is it sane for a programmer to start a new one? Are there industry sections in need?

Reasons to do it:
You want to learn about how to make a 3D engine, and don't really care if anyone but yourself uses it.
None of the existing engines do what you want and it's too much trouble to modify their source code (if you can even get it).
You have such an awesome idea and no other engine has done it so you need to do it because whatever you're doing doesn't exist yet.
Reasons not to do it:
You don't have enough of these resources: time/budget/expertise.
An existing engine fits your needs perfectly.

There are lots of reasons to build a new 3D engine (in no particular order):
The old one is a first person shooter, but you want a flight simulator.
The old one works, but isn't easy to use or has too many bugs
Someone else owns the old one
New hardware feature XYZ is fundamentally incompatible with the old engine
Someone is paying you to build one
You've never built one before.
Your game (simulation) only needs χ, but the old engine provides χ, ψ, ζ, α, β, γ, δ, and even π.
I happen to be building an OpenGL-based 3D engine in my off time right now. By implementing it myself, I'm expanding my basic knowledge of OpenGL way more than I would have by programming to someone else's interface (way more than I did when I implemented my own software affine texture mapped engine years ago). The downside is that I may never finish it :)

Generally, you code one if you have a need for one and there doesn't already exist one that suits your need.
Is there someone out there who needs an engine built for them because there doesn't exist one that suits their needs? Probably.

This is highly similar to the question should I write my own program/technology/framework X instead of using an existing one?" and that has been asked plenty, so I won't go over the usual boilerplate reasons.
While the answer to this question will always be somewhat subjective, a great deal depends on the context in which it is asked.
If it's being asked along the lines of I want to learn about game engines and rendering then it always can be beneficial to write your own game engine as developing the code is arguably the best way to learn. However, there may already exist good open sourced well documented engines to learn from as well.
If it's a commercial endeavor, then it's more of an issue of whether or not an existing engine provides what is needed. Modern commercial engines are written by some truly brilliant people and contain all the latest bells and whistles so it's more than likely they would suffice. This is evident by the sheer number of games that have been developed on two of the most popular game engines: idTech and Unreal engine. However, there still may be no tech related prohibiting factors to using an existing engine where writing your own is better. Such as if it can afford to be done, whether the engine can be licensed adequately, and, if the license can be afforded.

Personal tool for authoring & organizing user stories? [closed]

What about a simple spreadsheet (like this one)? A spreadsheet is extremely powerful (to re/organize, filter, etc) and has always worked well for me (use indentation if required or a additional column for IDs of related stories).

We use a combination of spreadsheets and an internal Wiki for user stories. The spreadsheet holds the basic information, like ID, title, user role, priority and so on as well as a link to the Wiki page for this story.
The Wiki page then has all the information about the user story, a full description, acceptance criteria, design notes and so on.
If there are dependencies between stories these are included as links within the user story usually with a short note about what this dependency means (e.g. "This story assumes that story x has been completed" or "Y & Z are not part of this story, but of story X").
This is a pretty low-tech solution, and doesn't really support visual diagrams of relations. However, it has worked for us so far.

Perhaps this is overkill, but does Mingle by Thoughtworks do what you need?
(I am not actually a Mingle user, but this sounds like the sort of thing it would do.)

Effective strategies for studying frameworks/ libraries partially [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I remember the old effective approach of studying a new framework. It was always the best way to read a good book on the subject, say MFC. When I tried to skip a lot of material to speed up coding it turned out later that it would be quicker to read the whole book first. There was no good ways to study a framework in small parts. Or at least I did not see them then.
The last years a lot of new things happened: improved search results from Google, programming blogs, much more people involved in Internet discussions, a lot of open source frameworks.
Right now when we write software we much often depend on third-party (usually open source) frameworks/ libraries. And a lot of times we need to know only a small amount of their functionality to use them. It's just about finding the simplest way of using a small subset of the library without unnecessary pessimizations.
What do you do to study as less as possible of the framework and still use it effectively?
For example, suppose you need to index a set of documents with Lucene. And you need to highlight search snippets. You don't care about stemmers, storing the index in one file vs. multiple files, fuzzy queries and a lot of other stuff that is going to occupy your brain if you study Lucene in depth.
So what are your strategies, approaches, tricks to save your time?
I will enumerate what I would do, though I feel that my process can be improved.
Search "lucene tutorial", "lucene highlight example" and so on. Try to estimate trust score of unofficial articles ( blog posts ) based on publishing date, the number and the tone of the comments. If there is no a definite answer - collect new search keywords and links on the target.
Search for really quick tutorials/ newbie guides on official site
Estimate how valuable are javadocs for a newbie. (Read Lucene highlight package summary)
Search for simple examples that come with a library, related to what you need. ( Study "src/demo/org/apache/lucene/demo")
Ask about "simple Lucene search highlighting example" in Lucene mail list. You can get no answer or even get a bad reputation if you ask a silly question. And often you don't know whether you question is silly because you have not studied the framework in depth.
Ask it on Stackoverflow or other QA service "could you give me a working example of search keywords highlighting in Lucene". However this question is very specific and can gain no answers or a bad score.
Estimate how easy to get the answer from the framework code if it's open sourced.
What are your study/ search routes? Write them in priority order if possible.

I use a three phase technique for evaluating APIs.
1) Discovery - In this phase I search StackOverflow, CodeProject, Google and Newsgroups with as many different combination of search phrases as possible and add everything that might fit my needs into a huge list.
2) Filter/Sort - For each item I found in my gathering phase I try to find out if it suits my needs. To do this I jump right into the API documentation and make sure it has all of the features I need. The results of this go into a weighted list with the best solutions at the top and all of the cruft filtered out.
3) Prototype - I take the top few contenders and try to do a small implementation hitting all of the important features. Whatever fits the project best here wins. If for some reason an issue comes up with the best choice during implementation, it's possible to fall back on other implementations.
Of course, a huge number of factors go into choosing the best API for the project. Some important ones:
How much will this increase the size of my distribution?
How well does the API fit with the style of my existing code?
Does it have high quality/any documentation?
Is it used by a lot of people?
How active is the community?
How active is the development team?
How responsive is the development team to bug patch requests?
Will the development team accept my patches?
Can I extend it to fit my needs?
How expensive will it be to implement overall?
... And of course many more. It's all very project dependent.
As to saving time, I would say trying to save too much here will just come back to bite you later. The time put into selecting a good library is at least as important as the time spent implementing it. Also, think down the road, in six months would you rather be happily coding or would you rather be arguing with a xenophobic dev team :). Spending a couple of extra days now doing a thorough evaluation of your choices can save a lot of pain later.

The answer to your question depends on where you fall on the continuum of generality/specificity. Do you want to solve an immediate problem? Are you looking to develop a deep understanding of the library? Chances are you’re somewhere between those extremes. Jeff Atwood has a post about how programmers move between these levels, based on their need.
When first getting started, read something on the high-level design of the framework or library (or language, or whatever technology it is), preferably by one of the designers. Try to determine what problems they are trying to address, what the organizing principles behind the design are, and what the central features are. This will form the conceptual framework from which future understanding will hang.
Now jump in to it. Create something. Do not copy and paste somebody's code. Instead, when things don’t work, read the error messages in detail, and the help on those error messages, and figure out why that error occurred. It can be frustrating, when things don’t work, but it forces you to think, and that’s when you learn.

1) Search Google for my task
2) look at examples with a few different libraries, no need to tie myself down to Lucene for example, if I don't know what other options I have.
3) Look at the date of last update on the main page, if it hasn't been updated in 6-months leave (with some exceptions)
4) Search for sample task with library (don't read tutorials yet)
5) Can I understand what's going on without a tutorial? If yes continue if no start back at 1
6) Try to implement the task
7) Watch myself fail
8) Read a tutorial
9) Try to implement the task
10) Watch myself fail and ask on StackOverflow, or mail the authors, post on user group (if friendly looking)
11) If I could get the task done, I'll consider the framework worthy of study and read up the main tutorial for 2 hours (if it doesn't fit in 2 hours I just ignore what's left until I need it)

I have no recipe, in the sense of a set of steps I always follow, that's largely because everything I learn is different. Some things are radically new to me (Dojo for example, I have no fluency in Java script so that's a big task), some just enhancements of previous knowledge (Iknow EJB 2 well, so learning EJB 3 while on the surface is new with all its annotations, its building on concepts.)
My general strategy though is I'd describe as "Spiral and Park". I try to circle the landscape first, understand the general shape, I Park concepts that I don't get just yet, don't let it worry me. Then i go a little deeper into some areas, but again try not to get obsessed with one, Spiralling down into the subject. Hopefully I start to unpark and understand, but also need to park more things.
Initially I want answers to questions such as:
What's it for?
Why would I use this rather than that other thing I already know
What's possible? Any interesting sweet spots. "Eg. ooh look at that nice AJAX-driven update"
I do a great deal of skim reading.
Then I want to do more exploring on the hows. I start to look for gotchas and good advice. (Eg. in java: why is "wibble".equals(var) a useful construct?)
Specific techniques and information sources:
Most important: doing! As early as possible I want to work a tutorial or two. I probably have to get the first circuit of the spiral done, but then I want to touch and experiment.
Overview documents
Product documents
Forums and discussion groups, learning by answering questions is my favourite technique.
if at all possible I try to find gurus. I'm fortunate in having in my immediate colleagues a wealth of knowledge and experience.

Quick-start guides.
A quick look at the API documentation if available.
Reading sample codes.
Messing around YOU HAVE TO MESS AROUND (sorry for the caps).
If it's a small library/API with a small or no community you can always contact the developer himself and ask for help 'cause he'll probably be more than happy to help you; he's happy that one more person is using his API.

Mailing lists are a great resource as long as you do your homework first before asking questions.
Mailing list archives are invaluable for most of the questions I've had on CoreAudio related stuff.

I would never read javadoc. As there often is none. And when there is, most likely it isnt up to date. So one gets confused at the best.
Start with the simplest possible tutorial you find within some minutes.
Often the tutorial will lead you to further sources at the end, so then most of the time one is on a path that goes on and on, deeper and deeper.

It really depends on what the topic is and how much info is on it. Learning by example is a good way to start a topic brand new to you, especially if you're knowledgeable in other similar libraries or languages. You can take a topic you're familiar with, and say "I understand how to implement using X, lets see how it's done using Y".

So what are your strategies, approaches, tricks to save your time?
Well, I search. I generally never ask questions, preferring to research myself. If worse comes to worse I'll read the documentation. In some cases (say, when I was doing some work with SharpSVN) I had to look at the source, specifically the test cases, to get some information about how the API worked.
Generally, I have to be honest, most of my 'study' and 'learning' is by accident.
For example, just a few seconds ago, I discovered how to get the "Recent" folder in C#. I had no idea how to do that before seeing the question, considering it interesting, and then searching.
So for me the real 'trick' is that I hang around on forums, answer questions, and accidentally pick up knowledge. Then when it comes time for me to research something; chances are I know a bit about it, and searching is easier and I can focus on the implementation [typically implementing a test program first] and progressing from there.

Building a web search engine [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I've always been interested in developing a web search engine. What's a good place to start? I've heard of Lucene, but I'm not a big Java guy. Any other good resources or open source projects?
I understand it's a huge under-taking, but that's part of the appeal. I'm not looking to create the next Google, just something I can use to search a sub-set of sites that I might be interested in.

There are several parts to a search engine. Broadly speaking, in a hopelessly general manner (folks, feel free to edit if you feel you can add better descriptions, links, etc):
The crawler. This is the part that goes through the web, grabs the pages, and stores information about them into some central data store. In addition to the text itself, you will want things like the time you accessed it, etc. The crawler needs to be smart enough to know how often to hit certain domains, to obey the robots.txt convention, etc.
The parser. This reads the data fetched by the crawler, parses it, saves whatever metadata it needs to, throws away junk, and possibly makes suggestions to the crawler on what to fetch next time around.
The indexer. Reads the stuff the parser parsed, and creates inverted indexes into the terms found on the webpages. It can be as smart as you want it to be -- apply NLP techniques to make indexes of concepts, cross-link things, throw in synonyms, etc.
The ranking engine. Given a few thousand URLs matching "apple", how do you decide which result is the best? Jut the index doesn't give you that information. You need to analyze the text, the linking structure, and whatever other pieces you want to look at, and create some scores. This may be done completely on the fly (that's really hard), or based on some pre-computed notions of "experts" (see PageRank, etc).
The front end. Something needs to receive user queries, hit the central engine, and respond; this something needs to be smart about caching results, possibly mixing in results from other sources, etc. It has its own set of problems.
My advice -- choose which of these interests you the most, download Lucene or Xapian or any other open source project out there, pull out the bit that does one of the above tasks, and try to replace it. Hopefully, with something better :-).
Some links that may prove useful:
"Agile web-crawler", a paper from Estonia (in English)
Sphinx Search engine, an indexing and search api. Designed for large DBs, but modular and open-ended.
"Information Retrieval, a textbook about IR from Manning et al. Good overview of how the indexes are built, various issues that come up, as well as some discussion of crawling, etc. Free online version (for now)!

Xapian is another option for you. I've heard it scales better than some implementations of Lucene.

Check out nutch, it's written by the same guy that created Lucene (Doug Cutting).

It seems to me that the biggest part is the indexing of sites. Making bots to scour the internet and parse their contents.
A friend and I were talking about how amazing Google and other search engines have to be under the hood. Millions of results in under half a second? Crazy. I think that they might have preset search results for commonly searched items.
edit:
This site looks rather interesting.

I would start with an existing project, such as the open source search engine from Wikia.
[My understanding is that the Wikia Search project has ended. However I think getting involved with an existing open-source project is a good way to ease into an undertaking of this size.]
http://re.search.wikia.com/about/get_involved.html

If you're interested in learning about the theory behind information retrieval and some of the technical details behind implementing search engines, I can recommend the book Managing Gigabytes by Ian Witten, Alistair Moffat and Tim C. Bell. (Disclosure: Alistair Moffat was my university supervisor.) Although it's a bit dated now (the first edition came out in 1994 and the second in 1999 -- what's so hard about managing gigabytes now?), the underlying theory is still sound and it's a great introduction to both indexing and the use of compression in indexing and retrieval systems.

I'm interested in Search Engine too. I recommended both Apache Hadoop MapReduce and Apache Lucene. Getting faster by Hadoop Cluster is the best way.

There are ports of Lucene. Zend have one freely available. Have a look at this quick tutorial: http://devzone.zend.com/node/view/id/91

Here's a slightly different approach, if you are not so much interested in the programming of it but more interested in the results: consider building it using Google Custom Search Engine API.
Advantages:
Google does all the heavy lifting for you
Familiar UI and behavior for your users
Can have something up and running in minutes
Lots of customization capabilities
Disadvantages:
You're not writing code, so no learning opportunity there
Everything you want to search must be public & in the Google index already
Your result is tied to Google

UML standards guide / Best Practices [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Does anyone know of a decent UML standards guide?
My company currently relies on UML 2.0 (rightly or wrongly) to do the majority (read all) of their design work. I have been asked to come up with a draft 'best practice' guide to help other developers develop better models. The main problem I face is that Im slightly biased against UML... I feel that: if a diagram takes more than 5 mins to draw then its too complicated! Im looking for advice predominantly on what sort of standards I should be looking at. Also Im looking for an external source of information that can be used to balence out my irrational loathing of UML-heavy design and act as a 'sanitizer' for my suggestions.
Most of all Im looking to write a useful document rather than one that will sit moulding away in some obscure network directory.
Any ideas?

UML Distilled by Martin Fowler

Like Paul C, I recommend UML Distilled. It is primarily about UML, but it contains a lot of insight about design in general (although it insists a bit too much on index cards IMO), it is short, pleasant to read, and to the point.
I strongly recommend against UML in a Nutshell. It is the worst O'Reilly book I have: insanely dense, hard to read and meandering. Not worth the paper it is printed on.

We are not talking about a book that says how to use UML, but rather a style or standards guide of some sort. Enter, UML profiles... This can get you both the standardization and reduced complexity you are looking for. You can limit the relationships and elements which can be used. You can also require certain things. A large company may choose to focus on the assets and data movement and limit it's standardized diagrams to this view. However, a company making real-time software for tanks might focus on action or flow.
The whole point of UML is that it is not specific and useful for every kind of situation. Martin Fowler and Elements of style books will not reduce diagramming time and increase comprehension. You need standardized profiles or patterns for than. I have seen it work, to the point that the business can read them. Many tools allow you to create a profile which eases learning curve for the designers and reduces drawing time.
MDA Distilled (OMG Press) is a good book if you want to understand the concepts, but it is not needed.
Really, UML Profiles. You don't want a standard because your company or your need is different. A standard for Web Services does not work for real-time or financial services.

Buy everyone a copy of The Elements of UML 2.0 Style. Job done.

For a quick reference on how to compose individual UML diagrams, I heartily recommend The Elements of UML Style 2.0 and I put my money where my recommendation lies by purchasing the 2nd edition to replace my 1st ed.
Apart from this recommendation, I think the most important thing in a company when introducing any style guide is to have a local feedback mechanism where people can post comments on which aspects of the style guide work for them, especially when you're using an official printed guide. A wiki or similar casual repository should suffice for this.
I also suggest highlighting diagrams which were particularly good examples (or bad ones, if the team humor could take it). Consider a framed Diagram of the Week like the Employee of the Week you see in so many stores. That gives a gentle reminder that diagram readability is taken seriously but hopefully with enough fun to get more buy-in to the concept.

I know you probably want an easy to read book for this but from what you are describing I would suggest going with the specs found on OMG itself. They are a bit much to read but would be as complete as you could hope for. They also have lonks to articles and tutorials that may be helpful.
As far as books go I have found that Using UML is quite good since it tackles the software development process as well as the UML tools and methods.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string