Lucene Near Realtime Search - search

I would like to improve the speed of my search system, by using some features like the "Near realtime search", making faster the opening of an index.
Who can tell me what are the differences and advantages of using one or the other of the following pieces of code:
IndexReader newIndexReader = IndexReader.openIfChanged(oldReader);
and
IndexWriter writer; // create an IndexWriter here
...
IndexReader reader = writer.getReader();
Note that in the first case i don't need that the indexWriter is in the same process of the IndexReader.

First of all, writer.getReader() is only available in 3.6 release but absent in all v4 (4.0, 4.1, ...) releases. So you shouldn't be using it if you care about maintainability.
Now, to answer your question. writer.getReader() flushes all pending writes and unconditionally opens a new reader (plus there are some other limitations - see method javadoc). IndexReader.openIfChanged(oldReader) only opens a new reader if there were any changes made, otherwise returns old reader.
To my taste, opening a reader from writer is also conceptually wrong (there used to be other cases in Lucene having this problem, e.g. it used to be possible to delete documents using reader).
If you are choosing between the two, I think there's no doubt which method to use.

Related

F# update list in multiple threads at same time

I am new to F# so maybe the solution can be clear to someone, but I can not find it.
Imagine a game world of world chunks (similar to Minecraft), but for more players.
In theory language like C++, java, or C# can modify multiple chunks of world at same time. Two or more players try to place or remove block in different chunks and all these actions can change the state of the world without affecting each other as long as no more than one action in each chunk is happening. Serializing will only happen when multiple players in one chunk perform the modification.
My understanding of F# is I need to serialize these actions on global level and no two actions can happen in same time in entire world, because the update function need actual world state update params(like add/remove blok) and return new world state.
For that example the world state contains chunk list.
Is there a way to do world update in parallel?
Can the world state be stored differently to allow update to multiple chunks at same time?
It sounds like you need to ensure that each chunk has one action run at a time. You can protect pieces of state by storing them inside mailbox processors (often referred to as just "agents"). You can send agents several messages from several threads. They will be queued and processed one at a time.
There is a detailed discussion of this here: https://fsharpforfunandprofit.com/posts/concurrency-actor-model/
Firstly, I don't this really adds any technical detail to a previous answer, so you if you like their solution you should go ahead and mark that as the answer. However, I hope this gives some extra context...
Underlying your problem is the question of how consistent do you require the state of your world to be in order to make decisions about modifying chunks.
Consider a world where I have two chunks, let's call them A and B. Consider the use case where I want to add or remove a block from chunk A. The all important question is:
Do I need to know about the blocks in chunk B in order to validate, and then perform the addition / removal of a block from chunk A.
For example, if I only have finite number of blocks in my world, I may well need this information to validate that I can actually add a block without going over my limit. The key here is that my "consistency boundary" is my entire world - in order to perform the addition of a new block to chunk A I need consistent information about everythign in my world. It is no good if halfway through my decision making another thread jumps in and adds a block to chunk B. If this is a requirement then you have no option - even in the C#/C++ case - you need to lock down access to your world so only one such action can be performed at anyone time.
From the way you phrase the question, I suspect this is not the case. In which case we need to examine exactly what your consistency requirements are. A weaker requirement is that if I'm adding blocks to chunk A, I at least have to have consistent information about the number (and position) of blocks in chunk A. In the C#/C++ case this would mean having to put locks around accessing to individual "chunk data", but not the whole world.
A simple way of modelling this in F# would be (using the suggestion in this answer):
open FSharp.Core
type ChunkMessage =
AddBlock
| RemoveBlock
type MyWorld =
{
Blocks : List<MailboxProcessor<ChunkMessage>>
}
Note that MyWorld is a mutable, but each MailboxProcessor encapsulates state which can only change through processing one message at a time.
The implementation of Blocks doesn't have to be a list of MailboxProcessor's, you could use a thread-safe collection of objects for which you had thread-safe methods on, but the use of them here as suggested by The Quick Brown Fox leads to a particularly nice programming model.

DDD / Aggregate Root / Versioning

How do we usually deal with versioning of an aggregate root?
I was thinking along this line (I'm in a survey-design domain).
One way to have versioning is to have an explicit method to create a new version, based on the existing one. For example, Study (an aggregate root).
So initially we have an aggregate root, whose root-entity is Study with (business) key "ABC", version "1".
By invoking the method "newVersion()" on the Study, a copy of that Study and all the other entities that belong to the same aggregate root will be created.
So basically, versioning is done through creation a separate instance (of aggregate root). The ID is composite (business key + version).
How do we know if it's a branch? or is it just one version up? (1.1? or 2). I guess, this simple rule would work: if there's no further version associated, then it's "one version up" (2); if there's already another version, than it's a branch (1.1).
Another concern: noise.
But that means, we cannot work on / modify existing version. We'd have to create a newVersion everytime we want to make modifications to our object. Everytime??? Hmmm.... Doesn't sound right.
Or... we can make rule like this, based on a flag (active / not-active, or published / un-published). If the flag is "not-active", we can modify the AR directly, without creating a new version. If the flag is active we have to either: (a) set it to "not-active" first, and modify.... or (b) create a newVersion and work on the version (initially set to "not-active").
Any thoughts / experience you want to share on this matter?
I think you will find things a bit confusing in researching this question, because there are two very different concepts at play:
Versioning as a concurrency control mechanism to support optimistic concurrency
Versioning as an explicit domain concept
Versioning to support Optimistic Concurrency
Optimistic concurrency is when two simultaneous transactions are allowed to start, but if they both try and modify the same data item, only the first one is permitted to proceed. See Concurrency Control for an overview of different locking strategies.
In summary, you leave versioning up to the persistence technology, because the purpose of the version is to detect simultaneous writes to the persistence layer.
When using this pattern, it's common to not even keep copies of old versions, however it's certainly possible to do so as an audit trail/change log.
Versioning as an explicit domain concept
Based on your question, and the need to support potential branching strategies, it sounds like versioning is an explicit domain concept in your domain - i.e. the concept of a "Version" is something that your domain experts talk about, and working with versions is an important part of the ubiquitous language.
However, you raise a few different concepts which indicate that the domain needs further exploration:
Version branching
User-defined version naming/tagging (but still connected to a 'chain' of versions)
Explicit version changes (user requested) vs implicit version changes (automatic on every change)
If I understand your intent correctly, with explicit versioning, the current 'active'/'live'/'tip' version is mutable and can be modified without tracking the change, until the user 'commits' it - it becomes immutable, and a new 'live' version that is mutable is created.
Some other concepts that may come up if you explore this version:
Branch merging (once you have split two branches, what happens if you want to bring them back together?)
Rolling back - if you have an old version, do you support 'undoing' one or more changes?
Given the above, you may also find some insights from the way that version control systems work both centralised (e.g. subversion) and distributed (e.g. git and mercurial), as they present an active working model of version tracking with a mixture of mutable and immutable elements.
The open questions here suggest to me that you need to explore this in more detail with your domain experts. With DDD sometimes it's easy to get lost in what you can do, but I strongly encourage you to try and understand what you need to do.
How do your users/domain experts think about the world? What kind of operations do they want to be able to do? What is the purpose of these operations towards their initial goal? Your aim is to distill the answers to these questions into a model that effectively encapsulates the processes they work with.
Edit to Consider Modelling
Based on your comment - my first response would be to challenge the interpretation of the word 'version' when thinking about the modified questionnaire. In fact, I'd be tempted to challenge the modelling of the template/survey relationship. Consider a possible set of entities:
Template
Defines the set of questions in the questionnaire
Supports operations:
StartSurvey
Various operations to modify the questions and options in the template etc.
Survey
Rather than referencing a 'live' template, the survey would own it's own questionnaire
When you call Template.StartSurvey it returns a Survey that is prefilled with the list of questions from the template
A survey also supports modifying the questions - but this doesn't change the template it was created from
Unlike a template, a survey also maintains a list of recorded answers, and offers operations to set the answers
It probably also includes a lifecycle state wherein in some states answering questions is permitted, but once 'submitted' you can't modify the answers (just guessing on this one).
In this world, the survey is 'stamped out' from the template, but then lives an independent life. You can modify the questionnaire in the survey all you like, and it won't effect the template.
The trade-off here is that if you do modify the template, none of the surveys that have already been created from it would get updated - but it sounds like that might be safer for you anyway?
You could also support operations to convert a survey back into a template so that if you like the look of a modified survey, you could 'templatize' it so it could be used for future surveys.

SWI-Prolog stream I/O and tab completion in the swipl-window

What I'm Doing
I am currently working on creating a SWI-Prolog module that adds tab-completion capability to the swipl-win window. So far I've actually gotten it to where it reads a single character at a time without stopping/returning anything until a tab character is typed. I have also already written a predicate that returns all possible completions of an incompletely typed term by using substring-matching on a list of current terms (obtained via current_functor/2, current_arithmetic_function/1, current_predicate/2, etc [the predicate used will eventually be based off of context]).
If you want to see my code, it is here...just keep in mind that I'm not exactly a Prolog master yet (friendly tips are more than welcome).
What I'm Thinking
I realize that when I actually implement my main completion predicate (still unwritten), I'll have to figure out what the last "word" is in the input stream. I'm debating on whether I should create a new stream with everything in the input stream so far (so I don't have to change the position in the input stream/go back to the beginning) or write to a string...if I take the second approach, I'll start over on the string whenever a delimiting character is inputted (characters that start a new "word", like space, comma, parentheses, operators, etc.) so there won't be any searching through the stream every time tab is pressed.
However, there is another thing: When the user is navigating through and modifying a typed but not-yet-submitted query (via arrow keys and backspace and such), a separate stream is necessary to handle mid-stream completion. A string will do just fine if completion is requested at the end of a stream (handling backspace is as easy as lopping off the last character of the string), but since the string would only contain the current "word", tabber.pl would be at a loss in instances like that. Unless, of course, the current-word string would update and find the current word that the cursor is in as the user navigated and typed mid-stream... (could I use at_end_of_stream(Stream) for that?)
What I'm Asking
How do you think I ought to approach this (string or stream)? The store-to-string method and the make-a-new-stream way both sound like they each have their advantages, so I'm pretty sure the solution will be some sort of combination of both. Any ideas, corrections, or suggestions on accomplishing my goal? (pun intended)
In order to figure that out and really do this correctly, I think I'll also have to know how SWI-Prolog use the input and output streams in the swipl-win window. (It's obviously accepting input, but does it use the output stream to write to the window as you type [into the input stream]?)
Getting this done without changing the C code underlying the swipl-win.exe console will be hard. This also relates to a thread on the mailing list starting here. The completion caller is in src/pl-ntmain.c, do_complete()
for Windows and src/os/pl-rl.c, prolog_completion() for the GNU readline based completion used on Unix systems.
The first step to make is lead these two and the upcoming one described in the referenced thread back
to Prolog using a callback. That requires a small study of the design of the completion interfaces to arrive at a suitable Prolog callback. I guess that should pass in some representation of the entire line and the caret location and return a list of completions from the caret. With that, anyone can write their own smart completer.

The best IR software for my use?

I want to take what people chat about in a chat room and do the following information retrieval:
Get the keywords
Ignore all noise words, keep verb an nouns mainly
Perform stemming on the keywords so that I don't store the same keyword in many forms
If a synonym keyword is already stored in my storage then the existing synonym should be used instead of the new keyword
Store the processed keyword in a persistant storage with a reference to the chat message it was located in and the user who uttered it
With this prosessed information I want to slowly get an idea of what people are talking about in chatrooms, and then use this to automatically find related chatrooms etc. based on these keywords.
My question to you is a follows: What is the best C/C++ or .NET tools for doing the above?
I partially agree with #larsmans comment. Your question, in practice, may indeed be more complex than the question you posted.
However, simplifying the question/problem, I guess the answer to your question could be one of Lucene's implementation: Lucene (Java), Lucene.Net (C#) or CLucene (C++).
Following the points in your question:
Lucene would take care of point 1 by using String tokenizers (you can customize or use your own).
For point 2 you could use a TokenFilter like StopFilter so Lucene can read a list of stopwords ("the", "a", "an"...) that it should not use.
For point 3 you could use PorterStemFilter.
Point 4 is a little bit trickier, but could be done using a customized TokenFilter.
Point 1 to 4 are perfomed in the Analysis/tokenization phase, which an Analyzer is responsible.
Regarding point 5, in Lucene you can store Documents with fields. A document can have an arbitrary number and mix of fields. So you could create a single Document for each chat room with all its text concatenated, and have another field of the document reference the chatroom it was extracted from. You will end up with a bunch of Lucene documents that you can compare. So you can compare your current chat room with others to see which one is more similar to the one you are on.
If all you want is a set of the best keywords to describe a chatrom your needs are closer to information extraction/automatic summarization/topic spotting task as #larsmans said. But you can still use Lucene for the parsing/tokenization phase.
*I referenced the Java docs, but CLucene and Lucene.Net have very similar APIs so it won't be much trouble to figure out the differences.

Meaning of Leaky Abstraction?

What does the term "Leaky Abstraction" mean? (Please explain with examples. I often have a hard time grokking a mere theory.)
Here's a meatspace example:
Automobiles have abstractions for drivers. In its purest form, there's a steering wheel, accelerator and brake. This abstraction hides a lot of detail about what's under the hood: engine, cams, timing belt, spark plugs, radiator, etc.
The neat thing about this abstraction is that we can replace parts of the implementation with improved parts without retraining the user. Let's say we replace the distributor cap with electronic ignition, and we replace the fixed cam with a variable cam. These changes improve performance but the user still steers with the wheel and uses the pedals to start and stop.
It's actually quite remarkable... a 16 year old or an 80 year old can operate this complicated piece of machinery without really knowing much about how it works inside!
But there are leaks. The transmission is a small leak. In an automatic transmission you can feel the car lose power for a moment as it switches gears, whereas in CVT you feel smooth torque all the way up.
There are bigger leaks, too. If you rev the engine too fast, you may do damage to it. If the engine block is too cold, the car may not start or it may have poor performance. And if you crank the radio, headlights, and AC all at the same time, you'll see your gas mileage go down.
It simply means that your abstraction exposes some of the implementation details, or that you need to be aware of the implementation details when using the abstraction. The term is attributed to Joel Spolsky, circa 2002. See the wikipedia article for more information.
A classic example are network libraries that allow you to treat remote files as local. The developer using this abstraction must be aware that network problems may cause this to fail in ways that local files do not. You then need to develop code to handle specifically errors outside the abstraction that the network library provides.
Wikipedia has a pretty good definition for this
A leaky abstraction refers to any implemented abstraction, intended to reduce (or hide) complexity, where the underlying details are not completely hidden
Or in other words for software it's when you can observe implementation details of a feature via limitations or side effects in the program.
A quick example would be C# / VB.Net closures and their inability to capture ref / out parameters. The reason they cannot be captured is due to an implementation detail of how the lifting process occurs. This is not to say though that there is a better way of doing this.
Here's an example familiar to .NET developers: ASP.NET's Page class attempts to hide the details of HTTP operations, particularly the management of form data, so that developers don't have to deal with posted values (because it automatically maps form values to server controls).
But if you wander beyond the most basic usage scenarios the Page abstraction begins to leak and it becomes hard to work with pages unless you understand the class' implementation details.
One common example is dynamically adding controls to a page - the value of dynamically-added controls won't be mapped for you unless you add them at just the right time: before the underlying engine maps the incoming form values to the appropriate controls. When you have to learn that, the abstraction has leaked.
Well, in a way it is a purely theoretical thing, though not unimportant.
We use abstractions to make things easier to comprehend. I may operate on a string class in some language to hide the fact that I'm dealing with an ordered set of characters that are individual items. I deal with an ordered set of characters to hide the fact that I'm dealing with numbers. I deal with numbers to hide the fact that I'm dealing with 1s and 0s.
A leaky abstraction is one that doesn't hide the details its meant to hide. If call string.Length on a 5-character string in Java or .NET I could get any answer from 5 to 10, because of implementation details where what those languages call characters are really UTF-16 data-points which can represent either 1 or .5 of a character. The abstraction has leaked. Not leaking it though means that finding the length would either require more storage space (to store the real length) or change from being O(1) to O(n) (to work out what the real length is). If I care about the real answer (often you don't really) you need to work on the knowledge of what is really going on.
More debatable cases happen with cases like where a method or property lets you get in at the inner workings, whether they are abstraction leaks, or well-defined ways to move to a lower level of abstraction, can sometimes be a matter people disagree on.
I'll continue in the vein of giving examples by using RPC.
In the ideal world of RPC, a remote procedure call should look like a local procedure call (or so the story goes). It should be completely transparent to the programmer such that when they call SomeObject.someFunction() they have no idea if SomeObject (or just someFunction for that matter) are locally stored and executed or remotely stored and executed. The theory goes that this makes programming simpler.
The reality is different because there's a HUGE difference between making a local function call (even if you're using the world's slowest interpreted language) and:
calling through a proxy object
serializing your parameters
making a network connection (if not already established)
transmitting the data to the remote proxy
having the remote proxy restore the data and call the remote function on your behalf
serializing the return value(s)
transmitting the return values to the local proxy
reassembling the serialized data
returning the response from the remote function
In time alone that's about three orders (or more!) of magnitude difference. Those three+ orders of magnitude are going to make a huge difference in performance that will make your abstraction of a procedure call leak rather obviously the first time you mistakenly treat an RPC as a real function call. Further a real function call, barring serious problems in your code, will have very few failure points outside of implementation bugs. An RPC call has all of the following possible problems that will get slathered on as failure cases over and above what you'd expect from a regular local call:
you might not be able to instantiate your local proxy
you might not be able to instantiate your remote proxy
the proxies may not be able to connect
the parameters you send may not make it intact or at all
the return value the remote sends may not make it intact or at all
So now your RPC call which is "just like a local function call" has a whole buttload of extra failure conditions you don't have to contend with when doing local function calls. The abstraction has leaked again, even harder.
In the end RPC is a bad abstraction because it leaks like a sieve at every level -- when successful and when failing both.
What is abstraction?
Abstraction is a way of simplifying the world.
It means you don't have to worry about what is actually happening under the hood.
Example: Flying a 737/747 is "abstracted" away
Planes are complicated systems: it involves: jet engines, oxygen systems, electrical systems, landing gear systems etc.
...but the pilot doesn't have to worry about it... all of that is "abstracted away". The only thing a pilot needs to focus on is yoke (i.e. steering wheel of the plane).
He pushes the yoke left to go left, and right to go right etc.
....that is in an ideal world. In reality, flying a plane is much more complicated. Because many details ARE NOT "abstracted away".
Leaky Abstractions in 737 Example
Pilots in reality have to worry about a LOT of things: wind speed, thrust, angles of attack, fuel, altitude, weather problems, angles of descent. Computers can help the pilot in these tasks, but not everything is automated / simplified......not everything is "abstracted away".
e.g. If the pilot pulls up too hard on the column - the plane will obey, but then the plane might stall, and that's really bad.
In other words, it is not enough for the pilot to simply control the steering wheel without knowing anything else.........nooooo.......the pilot must know about the underlying risks and limitations of the plane before the pilot flies one.......the pilot must know how the plane works, and how the plane flies; the pilot must know implementation details..... that pulling up too hard will lead to a stall, or that landing too steeply will destroy the plane etc.
Those things are not abstracted away. A lot of things are abstracted, but not everything. The abstraction is "leaky".
Leaky Abstractions in Code
......it's the same thing in your code. If you don't know the underlying implementation details, then you're gonna have problems.
ORMs abstract a lot of the hassle in dealing with database queries, but if you've ever done something like:
User.all.each do |user|
puts user.name # let's print each user's name
end
Then you will realise that's a nice way to kill your app. You need to know that calling User.allwith 25 million users is going to spike your memory usage, and is going to cause problems. You need to know some underlying details. The abstraction is leaky.
An example in the django ORM many-to-many example:
Notice in the Sample API Usage that you need to .save() the base Article object a1 before you can add Publication objects to the many-to-many attribute. And notice that updating the many-to-many attribute saves to the underlying database immediately, whereas updating a singular attribute is not reflected in the db until the .save() is called.
The abstraction is that we are working with an object graph, where single-value attributes and mult-value attributes are just attributes. But the implementation as a relational database backed data store leaks... as the integrity system of the RDBS appears through the thin veneer of an object interface.
The fact that at some point, which will guided by your scale and execution, you will be needed to get familiar with the implementation details of your abstraction framework in order to understand why it behave that way it behave.
For example, consider this SQL query:
SELECT id, first_name, last_name, age, subject FROM student_details;
And its alternative:
SELECT * FROM student_details;
Now, they do look like a logically equivalent solutions, but the performance of the first one is better due the individual column names specification.
It's a trivial example but eventually it comes back to Joel Spolsky quote:
All non-trivial abstractions, to some degree, are leaky.
At some point, when you will reach a certain scale in your operation, you will want to optimize the way your DB (SQL) works. To do it, you will need to know the way relational databases works. It was abstracted to you in the beginning, but it's leaky. You need to learn it at some point.
Assume, we have the following code in a library:
Object[] fetchDeviceColorAndModel(String serialNumberOfDevice)
{
//fetch Device Color and Device Model from DB.
//create new Object[] and set 0th field with color and 1st field with model value.
}
When the consumer calls the API, they get an Object[]. The consumer has to understand that the first field of the object array has color value and second field is the model value. Here the abstraction has leaked from library to the consumer code.
One of the solutions is to return an object which encapsulates Model and Color of the Device. The consumer can call that object to get the model and color value.
DeviceColorAndModel fetchDeviceColorAndModel(String serialNumberOfTheDevice)
{
//fetch Device Color and Device Model from DB.
return new DeviceColorAndModel(color, model);
}
Leaky abstraction is all about encapsulating state. very simple example of leaky abstraction:
$currentTime = new DateTime();
$bankAccount1->setLastRefresh($currentTime);
$bankAccount2->setLastRefresh($currentTime);
$currentTime->setTimestamp($aTimestamp);
class BankAccount {
// ...
public function setLastRefresh(DateTimeImmutable $lastRefresh)
{
$this->lastRefresh = $lastRefresh;
} }
and the right way(not leaky abstraction):
class BankAccount
{
// ...
public function setLastRefresh(DateTime $lastRefresh)
{
$this->lastRefresh = clone $lastRefresh;
}
}
more description here.

Resources