Does anyone know where decent documentation describing the Lucene index format IN DETAIL on the web is? - search

I am mainly curious as to the inner workings of the engine itself. I couldnt find anything about the index format itself (IE in detail as though you were going to build your own compatible implementation) and how it works. I have poked through the code, but its a little large to swallow for what must be described somewhere since there are so many compatible ports to other languages around. Can anyone provide a decent link?

Have you seen this: http://lucene.apache.org/java/2_4_0/fileformats.html? It's the most detailed I've found.
Although Lucene in Action does stop short of the detail in that link, I found it a useful companion to keep a handle on the big picture concepts while understanding the nitty gritty.

Related

Learning Viewflow and Documentation

Viewflow sounds like a really useful library, but I've spent two days reading everything I can about it and still don't have a good idea of how to use it.
The current documentation seems to cover
The problems that viewflow is intended to solve
Some very basic examples
In-depth details of its various components.
What seems to be missing is the area between two and three. So I'm looking for something that gives a certain amount of detail and explains viewflow's overall structure while also explaining how things are being done and how one might do something differently if one wanted to.
Does anyone know of any tutorials or documentation that could help to fill this gap? Viewflow certainly sounds great but at the moment, I'm a long way from finding it usable.

Semantics based code search

We have a large number of repositories. We want to implement a semantics(functionality) based code search on those repositories. Right now, we already have implemented keyword based code search in which we crawled through all the repository files and indexed them using elasticsearch. But that doesn't solve our problem as some of the repositories are poorly commented and documented, thus searching for specific codes/libraries become difficult.
So my question is: Is there any opensource libraries or any previous work done in this field which could help us index the semantics of the repository files, so that searching the code becomes easy and this would also help us in re-usability of the codes. I have found some research papers like Semantic code browsing, Semantics-based code search etc. but were of no use as there was no actual implementation given. So can you please suggest some good libraries or projects which could help me in achieving the same.
P.S:-Moreover, companies like Koders, Google, cocycles.com etc. started their code search based on functionality. But most of them have shut down their operations without giving any proper feedback, can anyone please tell me what kind of difficulties they are facing.
not sure if this is what you're looking for, but I wrote https://github.com/google/zoekt , which uses ctags-based understanding of code to improve ranking.
Take a look at insight.io
It provides semantic search and browsing

Is there documentation that describes how each part of the Inno Setup WizardForm is connected?

I am having a very difficult time finding any documentation about how the various parts of the Inno Setup Wizardform are hooked together.
Using various answers here on StackOVerflow I have gathered some information but most of the time these are inferences rather and I'm not confident I really have a good grasp.
I have looked at the online help available at http://www.jrsoftware.org/ishelp/index.php and specifically the section on Support Classes Reference but it just gives me a list of all the parts of the wizard form. I really would like more information about how things like the InstallingPage and the InnerPage relate. I've looked through the listing of topics as well and nothing appears to relate to the question I have.
I'm just having a very difficult time grasping how all of those various parts hook together and where each part is in a hierarchy either visual or logical.
I guess I could go look at the source code for Inno Setup but I thought it'd be worth asking here first before diving into an unfamiliar code base in a language I've only been poking at for two days. If that is my only recourse, I guess I'll have to.
I would like more information about how things like the InstallingPage and the InnerPage relate.
No, WizardForm internals are not documented.
You might read the source.
A code search reveals that the search terms do not occur that often.
https://github.com/jrsoftware/issrc/search?q=InnerPage
https://github.com/jrsoftware/issrc/search?q=InstallingPage
This is the source for the Wizard itself and it's form.
The pages are hooked together via RegisterExistingPage().

Learning KML, but finding inconsistencies in specs

I need to dive into KML, so I'm looking at specs and examples on the web, but I get some inconsistent information.
For instance, some sites say you write things like <tessellate>true</tessellate> while others have <tessellate>1</tessellate>, and yet others say tessellate isn't necessary any more, since altitudeMode provides the same information.
So I've got a lot of little questions, such as which is safer here, 1/0 or true/false. And what is KML the abbreviation of, Keyhole Markup Language or Keyhole Modelling Language?
But my question basically is, where do I find the standard, the authoritative and definitive reference for KML? Then I can ignore the other sources.
http://code.google.com/apis/kml/documentation/kmlreference.html
Should probably be considered the primary reference. That's the one I always refer to.
It include Google specific extensions (prefixed by gx:) which you can choose to ignore (or not).
Although
http://www.opengeospatial.org/standards/kml/
is the real definitive source, its nowhere near as accessible as the Google one (which preceded OGC taking on KML) - I dont even know how to read the OGC one!
KML used to represent "Keyhole Markup Language"
Boolean, is probably best considered 1/0 - some clients might understand true etc, but for maximum compatiblity go with 1/0
btw, tessellate is useful in itself, it does not do the same as altitudeMode. They complement each other.

Effective strategies for studying frameworks/ libraries partially [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I remember the old effective approach of studying a new framework. It was always the best way to read a good book on the subject, say MFC. When I tried to skip a lot of material to speed up coding it turned out later that it would be quicker to read the whole book first. There was no good ways to study a framework in small parts. Or at least I did not see them then.
The last years a lot of new things happened: improved search results from Google, programming blogs, much more people involved in Internet discussions, a lot of open source frameworks.
Right now when we write software we much often depend on third-party (usually open source) frameworks/ libraries. And a lot of times we need to know only a small amount of their functionality to use them. It's just about finding the simplest way of using a small subset of the library without unnecessary pessimizations.
What do you do to study as less as possible of the framework and still use it effectively?
For example, suppose you need to index a set of documents with Lucene. And you need to highlight search snippets. You don't care about stemmers, storing the index in one file vs. multiple files, fuzzy queries and a lot of other stuff that is going to occupy your brain if you study Lucene in depth.
So what are your strategies, approaches, tricks to save your time?
I will enumerate what I would do, though I feel that my process can be improved.
Search "lucene tutorial", "lucene highlight example" and so on. Try to estimate trust score of unofficial articles ( blog posts ) based on publishing date, the number and the tone of the comments. If there is no a definite answer - collect new search keywords and links on the target.
Search for really quick tutorials/ newbie guides on official site
Estimate how valuable are javadocs for a newbie. (Read Lucene highlight package summary)
Search for simple examples that come with a library, related to what you need. ( Study "src/demo/org/apache/lucene/demo")
Ask about "simple Lucene search highlighting example" in Lucene mail list. You can get no answer or even get a bad reputation if you ask a silly question. And often you don't know whether you question is silly because you have not studied the framework in depth.
Ask it on Stackoverflow or other QA service "could you give me a working example of search keywords highlighting in Lucene". However this question is very specific and can gain no answers or a bad score.
Estimate how easy to get the answer from the framework code if it's open sourced.
What are your study/ search routes? Write them in priority order if possible.
I use a three phase technique for evaluating APIs.
1) Discovery - In this phase I search StackOverflow, CodeProject, Google and Newsgroups with as many different combination of search phrases as possible and add everything that might fit my needs into a huge list.
2) Filter/Sort - For each item I found in my gathering phase I try to find out if it suits my needs. To do this I jump right into the API documentation and make sure it has all of the features I need. The results of this go into a weighted list with the best solutions at the top and all of the cruft filtered out.
3) Prototype - I take the top few contenders and try to do a small implementation hitting all of the important features. Whatever fits the project best here wins. If for some reason an issue comes up with the best choice during implementation, it's possible to fall back on other implementations.
Of course, a huge number of factors go into choosing the best API for the project. Some important ones:
How much will this increase the size of my distribution?
How well does the API fit with the style of my existing code?
Does it have high quality/any documentation?
Is it used by a lot of people?
How active is the community?
How active is the development team?
How responsive is the development team to bug patch requests?
Will the development team accept my patches?
Can I extend it to fit my needs?
How expensive will it be to implement overall?
... And of course many more. It's all very project dependent.
As to saving time, I would say trying to save too much here will just come back to bite you later. The time put into selecting a good library is at least as important as the time spent implementing it. Also, think down the road, in six months would you rather be happily coding or would you rather be arguing with a xenophobic dev team :). Spending a couple of extra days now doing a thorough evaluation of your choices can save a lot of pain later.
The answer to your question depends on where you fall on the continuum of generality/specificity. Do you want to solve an immediate problem? Are you looking to develop a deep understanding of the library? Chances are you’re somewhere between those extremes. Jeff Atwood has a post about how programmers move between these levels, based on their need.
When first getting started, read something on the high-level design of the framework or library (or language, or whatever technology it is), preferably by one of the designers. Try to determine what problems they are trying to address, what the organizing principles behind the design are, and what the central features are. This will form the conceptual framework from which future understanding will hang.
Now jump in to it. Create something. Do not copy and paste somebody's code. Instead, when things don’t work, read the error messages in detail, and the help on those error messages, and figure out why that error occurred. It can be frustrating, when things don’t work, but it forces you to think, and that’s when you learn.
1) Search Google for my task
2) look at examples with a few different libraries, no need to tie myself down to Lucene for example, if I don't know what other options I have.
3) Look at the date of last update on the main page, if it hasn't been updated in 6-months leave (with some exceptions)
4) Search for sample task with library (don't read tutorials yet)
5) Can I understand what's going on without a tutorial? If yes continue if no start back at 1
6) Try to implement the task
7) Watch myself fail
8) Read a tutorial
9) Try to implement the task
10) Watch myself fail and ask on StackOverflow, or mail the authors, post on user group (if friendly looking)
11) If I could get the task done, I'll consider the framework worthy of study and read up the main tutorial for 2 hours (if it doesn't fit in 2 hours I just ignore what's left until I need it)
I have no recipe, in the sense of a set of steps I always follow, that's largely because everything I learn is different. Some things are radically new to me (Dojo for example, I have no fluency in Java script so that's a big task), some just enhancements of previous knowledge (Iknow EJB 2 well, so learning EJB 3 while on the surface is new with all its annotations, its building on concepts.)
My general strategy though is I'd describe as "Spiral and Park". I try to circle the landscape first, understand the general shape, I Park concepts that I don't get just yet, don't let it worry me. Then i go a little deeper into some areas, but again try not to get obsessed with one, Spiralling down into the subject. Hopefully I start to unpark and understand, but also need to park more things.
Initially I want answers to questions such as:
What's it for?
Why would I use this rather than that other thing I already know
What's possible? Any interesting sweet spots. "Eg. ooh look at that nice AJAX-driven update"
I do a great deal of skim reading.
Then I want to do more exploring on the hows. I start to look for gotchas and good advice. (Eg. in java: why is "wibble".equals(var) a useful construct?)
Specific techniques and information sources:
Most important: doing! As early as possible I want to work a tutorial or two. I probably have to get the first circuit of the spiral done, but then I want to touch and experiment.
Overview documents
Product documents
Forums and discussion groups, learning by answering questions is my favourite technique.
if at all possible I try to find gurus. I'm fortunate in having in my immediate colleagues a wealth of knowledge and experience.
Quick-start guides.
A quick look at the API documentation if available.
Reading sample codes.
Messing around YOU HAVE TO MESS AROUND (sorry for the caps).
If it's a small library/API with a small or no community you can always contact the developer himself and ask for help 'cause he'll probably be more than happy to help you; he's happy that one more person is using his API.
Mailing lists are a great resource as long as you do your homework first before asking questions.
Mailing list archives are invaluable for most of the questions I've had on CoreAudio related stuff.
I would never read javadoc. As there often is none. And when there is, most likely it isnt up to date. So one gets confused at the best.
Start with the simplest possible tutorial you find within some minutes.
Often the tutorial will lead you to further sources at the end, so then most of the time one is on a path that goes on and on, deeper and deeper.
It really depends on what the topic is and how much info is on it. Learning by example is a good way to start a topic brand new to you, especially if you're knowledgeable in other similar libraries or languages. You can take a topic you're familiar with, and say "I understand how to implement using X, lets see how it's done using Y".
So what are your strategies, approaches, tricks to save your time?
Well, I search. I generally never ask questions, preferring to research myself. If worse comes to worse I'll read the documentation. In some cases (say, when I was doing some work with SharpSVN) I had to look at the source, specifically the test cases, to get some information about how the API worked.
Generally, I have to be honest, most of my 'study' and 'learning' is by accident.
For example, just a few seconds ago, I discovered how to get the "Recent" folder in C#. I had no idea how to do that before seeing the question, considering it interesting, and then searching.
So for me the real 'trick' is that I hang around on forums, answer questions, and accidentally pick up knowledge. Then when it comes time for me to research something; chances are I know a bit about it, and searching is easier and I can focus on the implementation [typically implementing a test program first] and progressing from there.

Resources