Lucene NIOFSDirectory and SimpleFSDirectory with multiple threads

Lucene NIOFSDirectory and SimpleFSDirectory with multiple threads - multithreading

My basic question is: what's the proper way to create/use instances of NIOFSDirectory and SimpleFSDirectory when there's multiple threads that need to make queries (reads) on the same index. More to the point: should an instance of the XXXFSDirectory be created for each thread that needs to do a query and retrieve some results (and then in the same thread have it closed immediatelly after), or should I make a "global" (singleton?) instance which is passed to all threads and then they all use it at the same time (and it's no longer up to each thread to close it when it's done with a query)?
Here's more details:
I've read the docs on both NIOFSDirectory and SimpleFSDirectory and what I got is:
they both support multithreading:
NIOFSDirectory : "An FSDirectory implementation that uses java.nio's FileChannel's positional read, which allows multiple threads to read from the same file without synchronizing."
SimpleFSDirectory : "A straightforward implementation of FSDirectory using java.io.RandomAccessFile. However, this class has poor concurrent performance (multiple threads will bottleneck) as it synchronizes when multiple threads read from the same file. It's usually better to use NIOFSDirectory or MMapDirectory instead."
NIOFSDirectory is better suited (basically, faster) than SimpleFSDirectory in a multi threaded context (see above)
NIOFSDIrectory does not work well on Windows. On Windows SimpleFSDirectory is recomended. However on *nix OS NIOFSDIrectory works fine, and due to better performance when multi threading, it's recommended over SimpleFSDirectory.
"NOTE: NIOFSDirectory is not recommended on Windows because of a bug in how FileChannel.read is implemented in Sun's JRE. Inside of the implementation the position is apparently synchronized."
The reason I'm asking this is that I've seen some actual projects, where the target OS is Linux, NIOFSDirectory is used to read from the index, but an instance of it is created for each request (from each thread), and once the query is done and the results returned, the thread closes that instance (only to create a new one at the next request, etc). So I was wondering if this is really a better approach than to simply have a single NIOFSDirectory instance shared by all threads, and simply have it opened when the application starts, and closed much later on when a certain (multi threaded) job is finished...
More to the point, for a web application, isn't it better to have something like a context listener which creates an instance of NIOFSDirectory , places it in to the Application Context, all Servlets share and use it, and then the same context listener closes it when the app shuts down?

Official Lucene FAQ suggests the following:
Share a single IndexSearcher across queries and across threads in your
application.
IndexSearcher requires single IndexReader and the latter can be produced with a DirectoryReader.open(Directory) which would only require a single instance of Directory.

Related

Are there greenDAO thread safety best practices?

I'm having a go with greenDAO and so far it's going pretty well. One thing that doesn't seem to be covered by the docs or website (or anywhere :( ) is how it handles thread safety.
I know the basics mentioned elsewhere, like "use a single dao session" (general practice for Android + SQLite), and I understand the Java memory model quite well. The library internals even appear threadsafe, or at least built with that intention. But nothing I've seen covers this:
greenDAO caches entities by default. This is excellent for a completely single-threaded program - transparent and a massive performance boost for most uses. But if I e.g. loadAll() and then modify one of the elements, I'm modifying the same object globally across my app. If I'm using it on the main thread (e.g. for display), and updating the DB on a background thread (as is right and proper), there are obvious threading problems unless extra care is taken.
Does greenDAO do anything "under the hood" to protect against common application-level threading problems? For example, modifying a cached entity in the UI thread while saving it in a background thread (better hope they don't interleave! especially when modifying a list!)? Are there any "best practices" to protect against them, beyond general thread safety concerns (i.e. something that greenDAO expects and works well with)? Or is the whole cache fatally flawed from a multithreaded-application safety standpoint?

I've no experience with greenDAO but the documentation here:
http://greendao-orm.com/documentation/queries/
Says:
If you use queries in multiple threads, you must call forCurrentThread() on the query to get a Query instance for the current thread. Starting with greenDAO 1.3, object instances of Query are bound to their owning thread that build the query. This lets you safely set parameters on the Query object while other threads cannot interfere. If other threads try to set parameters on the query or execute the query bound to another thread, an exception will be thrown. Like this, you don’t need a synchronized statement. In fact you should avoid locking because this may lead to deadlocks if concurrent transactions use the same Query object.
To avoid those potential deadlocks completely, greenDAO 1.3 introduced the method forCurrentThread(). This will return a thread-local instance of the Query, which is safe to use in the current thread. Every time, forCurrentThread() is called, the parameters are set to the initial parameters at the time the query was built using its builder.
While so far as I can see the documentation doesn't explicitly say anything about multi threading other than this this seems pretty clear that it is handled. This is talking about multiple threads using the same Query object, so clearly multiple threads can access the same database. Certainly it's normal for databases and DAO to handle concurrent access and there are a lot of proven techniques for working with caches in this situation.

By default GreenDAO caches and returns cached entity instances to improve performance. To prevent this behaviour, you need to call:
daoSession.clear()
to clear all cached instances. Alternatively you can call:
objectDao.detachAll()
to clear cached instances only for the specific DAO object.
You will need to call these methods every time you want to clear the cached instances, so if you want to disable all caching, I recommend calling them in your Session or DAO accessor methods.
Documentation:
http://greenrobot.org/greendao/documentation/sessions/#Clear_the_identity_scope
Discussion: https://github.com/greenrobot/greenDAO/issues/776

Sqlite thread modes and sqlite misuse paradox

I have a project where i should use multiple tables to avoid keeping dublicated data in my sqlite file(Even though i knew usage of several tables was nightmare).
In my application i am reading data from one table in some method and inserting data into another table in some other method. When i do this i am getting from sqlite step function, error code 21 which is sqlite misuse.
Accoding to my researches that was because i was not able to reach tables from multi threads.
Up to now, i read the sqlite website and learned that there are 3 modes to configurate sqlite database:
1) singlethread: you have no chances to call several threads.
2) multithread: yeah multi thread; but there are some obstacles.
3) serialized: this is the best match with multithread database applications.
if sqlite3_threadsafe() == 2 returns true then yes your sqlite database is serialized and this returned true, so i proved it for myself.
then i have a code to configurate my sqlite database for serialized to take it under guarantee.
sqlite3_config(SQLITE_CONFIG_SERIALIZED);
when i use above codes in class where i read and insert data from 1 table works perfectly :). But if i try to use it in class where i read and insert data from 2 tables (actually where i really need it) problem sqlite misuse comes up.
I checked my code where i open and close database, there is no problem with them. they work unless i delete the other.
I am using ios5 and this is really a big problem for my project. i heard that instagram uses postgresql may be this was the reason ha? Would you suggest postgresql or sqlite at first?

It seems to me like you've got two things mixed up.
Single vs. multi-threaded
Single threaded builds are only ever safe to use from one thread of your code because they lack the mechanisms (mutexes, critical sections, etc.) internally that permit safe use from several. If you are using multiple threads, use a multi-threaded build (or expect “interesting” trouble; you have been warned).
SQLite's thread support is pretty simple. With a multi-threaded build, particular connections should only be used from a single thread (except that they can be initially opened in another).
All recent (last few years?) SQLite builds are happy with access to a single database from multiple processes, but the degree of parallelism depends on the…
Transaction type
SQL in general supports multiple types of transaction. SQLite supports only a subset of them, and its default is SERIALIZABLE. This is the safest mode of access; it simulates what you would see if only one thing could happen at a time. (Internally, it's implemented using a scheme that lets many readers in at once, but only one writer; there's some cleverness to prevent anyone from starving anyone else.)
SQLite also supports read-uncommitted transactions. This increases the amount of parallelism available to code, but at the risk of readers seeing information that's not yet been guaranteed to persist. Whether this matters to you depends on your application.

Does WinRT still have the same old UI threading restrictions?

In WinForms, pretty much all your UI is thread-specific. You have to use [STAThread] so that the common dialogs will work, and you can't (safely) access a UI element from any thread other than the one that created it. From what I've heard, that's because that's just how Windows works -- window handles are thread-specific.
In WPF, these same restrictions were kept, because ultimately it's still building on top of the same Windows API, still window handles (though mostly just for top-level windows), etc. In fact, WPF even made things more restrictive, because you can't even access things like bitmaps across threads.
Now along comes WinRT, a whole new way of accessing Windows -- a fresh, clean slate. Are we still stuck with the same old threading restrictions (specifically: only being able to manipulate a UI control from the thread that created it), or have they opened this up?

I would expect it to be the same model - but much easier to use, at least from C# and VB, with the new async handling which lets you write a synchronous-looking method which just uses "await" when it needs to wait for a long-running task to complete before proceeding.
Given the emphasis on making asynchronous code easier to write, it would be surprising for MS to forsake the efficiency of requiring single-threaded access to the UI at the same time.

The threading model is identical. There is still a notion of single threaded and multi-threaded apartments (STA/MTA), it must be initialized by a call to RoInitialize. Which behaves very much like CoInitialize in name, argument and error returns. The user interface thread is single threaded, confirmed at 36:00 in this video.

The HTML/CSS UI model is inherently single threaded (until the advent of web workers recently, JS didn't support threads). Xaml is also single threaded (because it's really hard for developers to write code to a multithreaded GUI).

The underlying threading model does have some key differences. When your application starts, an ASTA (Application STA) is created to run your UI code as I showed in the talk. This ASTA does not allow reentrancy - you will not receive unrelated calls while making an outgoing call. This is a significant difference from STAs.
You are allowed to create async workitems - see the Windows.System.Threadpool namespace. These workitem threads are automatically initialized to MTA. As Larry mentioned, webworkers are the JS equivalent concept.
Your UI components are thread affined. See the Windows.UI.Core.CoreDispatcher class for information on how to execute code on the UI thread. You can check out the threading sample for some example code to update the UI from an async operation.

Things are different in pretty important ways.
While it's true the underlying threading model is the same, your question is generally related to how logical concurrency works with UI, and with respect to this what developers see in Windows 8 will be new.
As you mention most dialogs previously blocked. For Metro apps many UI components do not block all. Remember the talk of WinRT being asynchronous? It applies to UI components also.
For example this .NET 4 code will not necessarily kill your harddrive because the UI call blocks on Show (C# example):
bool formatHardDrive = true;
if (MessageBox.Show("Format your harddrive?") == NO)
formatHardDrive = false;
if (formatHardDrive == true)
Format();
With Windows 8 Metro many UI components like Windows.UI.Popups.MessageDialog, are by default Asynchronous so the Show call would immediately (logically) fall through to the next line of code before the user input is retrieved.
Of course there is an elegant solution to this based on the await/promise design patterns (Javascript example):
var md = Windows.UI.Popups.MessageDialog("Hello World!");
md.showAsync().then(function (command) {
console.log("pressed: " + command.label); });
The point is that while the threading model doesn't change, when most people mention UI and threading they are thinking about logical concurrency and how it affects the programming model.
Overall I think the asynchronous paradigm shift is a positive thing. It requires a bit of a shift in perspective, but it's consistent with the way other platforms are evolving on both the client and server sides.

Thread-safety and concurrent modification of a table in SQLite3

Does thread-safety of SQLite3 mean different threads can modify the same table of a database concurrently?

No - SQLite does not support concurrent write access to the same database file. SQLite will simply block one of the transactions until the other one has finished.

note that if you're using python, to access a sqlite3 connection from different threads you need to disable the check_same_thread argument, e.g:
sqlite.connect(":memory:", check_same_thread = False)
as of the 24th of may 2010, the docs omit this option. the omission is listed as a bug here

Not necessarily. If sqlite3 is compiled with the thread safe macro (check via the int sqlite3_threadsafe(void) function), then you can try to access the same DB from multiple threads without the risk of corruption. Depending on the lock(s) required, however, you may or may not be able to actually modify data (I don't believe sqlite3 supports row locking, which means that to write, you'll need to get a table lock). However, you can try; if one threads blocks, then it will automatically write as soon as the other thread finishes with the DB.

You can use SQLite in 3 different modes:
http://www.sqlite.org/threadsafe.html
If you decide to multi-thread mode or serialized mode, you can easy use SQLite in multi-thread application.
In those situations you can read from all your threads simultaneously anyway. If you need to write simultaneously, the opened table will be lock automatycally for current writing thread and unlock after that (next thread will be waiting (mutex) for his turn until the table will be unlocked). In all those cases, you need to create separate connection string for every thread (.NET Data.Sqlite.dll). If you're using other implementation (e.g. any Android wrapper) sometimes the things are different.

How can threads be avoided?

I've read a lot recently about how writing multi-threaded apps is a huge pain in the neck, and have learned enough about the topic to understand, at least at some level, why it is so.
I've read that using functional programming techniques can help alleviate some of this pain, but I've never seen a simple example of functional code that is concurrent. So, what are some alternatives to using threads? At least, what are some ways to abstract them away so you needn't think about things like locking and whether a particular library's objects are thread-safe.
I know Google's MapReduce is supposed to help with the problem, but I haven't seen a succinct explanation of it.
Although I'm giving a specific example below, I'm more curious of general techniques than solving this specific problem (using the example to help illustrate other techniques would be helpful though).
I came to the question when I wrote a simple web crawler as a learning exercise. It works pretty well, but it is slow. Most of the bottleneck comes from downloading pages. It is currently single threaded, and thus only downloads a single page at a time. Thus, if the pages can be downloaded concurrently, it would speed things up dramatically, even if the crawler ran on a single processor machine. I looked into using threads to solve the issue, but they scare me. Any suggestions on how to add concurrency to this type of problem without unleashing a terrible threading nightmare?

The reason functional programming helps with concurrency is not because it avoids using threads.
Instead, functional programming preaches immutability, and the absence of side effects.
This means that an operation could be scaled out to N amount of threads or processes, without having to worry about messing with shared state.

Actually, threads are pretty easy to handle until you need to synchronize them. Usually, you use threadpool to add task and wait till they are finished.
It is when threads need to communicate and access shared data structures that multi threading becomes really complicated. As soon as you have two locks, you can get deadlocks, and this is where multithreading gets really hard. Sometimes, your locking code could be wrong by just a few instructions. In that case, you could only see bugs in production, on multi-core machines (if you developed on single core, happened to me) or they could be triggered by some other hardware or software. Unit testing doesn't help much here, testing finds bugs, but you can never be as sure as in "normal" apps.

I'll add an example of how functional code can be used to safely make code concurrent.
Here is some code you might want to do in parallel, so you don't have wait for one file to finish to start downloading the next:
void DownloadHTMLFiles(List<string> urls)
{
foreach(string url in urls)
{
DownlaodOneFile(url); //download html and save it to a file with a name based on the url - perhaps used for caching.
}
}
If you have a number of files the user might spend a minute or more waiting for them all. We can re-write this code functionally like this, and it basically does the exact same thing:
urls.ForEach(DownloadOneFile);
Note that this still runs sequentially. However, not only is it shorter, we've gained an important advantage here. Since each call to the DownloadOneFile function is completely isolated from the others (for our purposes, available bandwidth isn't an issue) you could very easily swap out the ForEach function for another very similar function: one that kicks off each call to DownlaodOneFile on a separate thread from a threadpool.
It turns out .Net has just such a function availabe using Parallel Extensions. So, by using functional programming you can change one line of code and suddenly have something run in parallel that used to run sequentially. That's pretty powerful.

There are a couple of brief mentions of asynchronous models but no one has really explained it so I thought I'd chime in. The most common method I've seen used as an alternative for multi-threading is asynchronous architectures. All that really means is that instead of executing code sequentially in a single thread, you use a polling method to initiate some functions and then come back and check periodically until there's data available.
This really only works in models like your aforementioned crawler, where the real bottleneck is I/O rather than CPU. In broad strokes, the asynchronous approach would initiate the downloads on several sockets, and a polling loop periodically checks to see if they're finished downloading and when that's done, we can move on to the next step. This allows you to run several downloads that are waiting on the network, by context switching within the same thread, as it were.
The multi-threaded model would work much the same, except using a separate thread rather than a polling loop checking multiple sockets in the same thread. In an I/O bound application, asynchronous polling works almost as well as threading for many use cases, since the real problem is simply waiting for the I/O to complete and not so much the waiting for the CPU to process the data.
Another real world example is for a system that needed to execute a number of other executables and wait for results. This can be done in threads, but it's also considerably simpler and almost as effective to simply fire off several external applications as Process objects, then check back periodically until they're all finished executing. This puts the CPU-intensive parts (the running code in the external executables) in their own processes, but the data processing is all handled asynchronously.
The Python ftp server lib I work on, pyftpdlib uses the Python asyncore library to handle serving FTP clients with only a single thread, and asynchronous socket communication for file transfers and command/response.
See for further reading the Python Twisted library's page on Asynchronous Programming - while somewhat specific to using Twisted, it also introduces async programming from a beginner perspective.

Concurrency is quite a complicated subject in computer science, which demands good understanding of hardware architecture as well as operating system behavior.
Multi-threading has many implementations based on your hardware and your hosting OS, and as tough as it is already, the pitfalls are numerous. It should be noted that in order to achieve "true" concurrency, threads are the only way to go. Basically, threads are the only way for you as a programmer to share resources between different parts of your software while allowing them to run in parallel. By parallel you should consider that a standard CPU (dual/multi-cores aside) can only do one thing at a time. Concepts like context switching now come into play, and they have their own set of rules and limitations.
I think you should seek more generic background on the subject, like you are saying, before you go about implementing concurrency in your program.
I guess the best place to start is the wikipedia article on concurrency, and go on from there.

What typically makes multi-threaded programming such a nightmare is when threads share resources and/or need to communicate with each other. In the case of downloading web pages, your threads would be working independently, so you may not have much trouble.
One thing you may want to consider is spawning multiple processes rather than multiple threads. In the case you mention--downloading web pages concurrently--you could split the workload up into multiple chunks and hand each chunk off to a separate instance of a tool (like cURL) to do the work.

If your goal is to achieve concurrency it will be hard to get away from using multiple threads or processes. The trick is not to avoid it but rather to manage it in a way that is reliable and non-error prone. Deadlocks and race conditions in particular are two aspects of concurrent programming that are easy to get wrong. One general approach to manage this is to use a producer/consumer queue... threads write work items to the queue and workers pull items from it. You must make sure you properly synchronize access to the queue and you're set.
Also, depending on your problem, you may also be able to create a domain specific language which does away with concurrency issues, at least from the perspective of the person using your language... of course the engine which processes the language still needs to handle concurrency, but if this will be leveraged across many users it could be of value.

There are some good libraries out there.
java.util.concurrent.ExecutorCompletionService will take a collection of Futures (i.e. tasks which return values), process them in background threads, then bung them in a Queue for you to process further as they complete. Of course, this is Java 5 and later, so isn't available everywhere.
In other words, all your code is single threaded - but where you can identify stuff safe to run in parallel, you can farm it off to a suitable library.
Point is, if you can make the tasks independent, then thread safety isn't impossible to achieve with a little thought - though it is strongly recommended you leave the complicated bit (like implementing the ExecutorCompletionService) to an expert...

One simple way to avoid threading in your simple scenario, Is to download from different processes. The main process will invoke other processes with parameters that will download the files to local directory, And then the main process can do the real job.
I don't think that there are any simple solution to those problems. Its not a threading problem. Its the concurrency that brake the human mind.

You might watch the MSDN video on the F# language: PDC 2008: An introduction to F#
This includes the two things you are looking for. (Functional + Asynchronous)

For python, this looks like an interesting approach: http://members.verizon.net/olsongt/stackless/why_stackless.html#introduction

Use Twisted. "Twisted is an event-driven networking engine written in Python" http://twistedmatrix.com/trac/. With it, I could make 100 asynchronous http requests at a time without using threads.

Your specific example is seldom solved with multi-threading. As many have said, this class of problems is IO-bound, meaning the processor has very little work to do, and spends most of it's time waiting for some data to arrive over the wire and to process that, and similarly it has to wait for disk buffers to flush so that it can put more of the recently downloaded data on disk.
The method to performance is through the select() facility, or an equivalent system call. The basic process is to open a number of sockets (for the web crawler downloads) and file handles (for storing them to disk). Next you set all of the different sockets and fh to non-blocking mode, meaning that instead of making your program wait until data is available to read after issuing a request, it returns right away with a special code (usually EAGAIN) to indicate that no data is ready. If you looped through all of the sockets in this way you would be polling, which works well, but is still a waste of cpu resources because your reads and writes will almost always return with EAGAIN.
To get around this, all of the sockets and fp's will be collected into a 'fd_set', which is passed to the select system call, then your program will block, waiting on ANY of the sockets, and will awaken your program when there's some data on any of the streams to process.
The other common case, compute bound work, is without a doubt best addressed with some sort of true parallelism (as apposed to the asynchronous concurrency presented above) to access the resources of multiple cpu's. In the case that your cpu bound task is running on a single threaded archetecture, definately avoid any concurrency, as the overhead will actually slow your task down.

Threads are not to be avoided nor are they "difficult". Functional programming is not necessarily the answer either. The .NET framework makes threading fairly simple. With a little thought you can make reasonable multithreaded programs.
Here's a sample of your webcrawler (in VB.NET)
Imports System.Threading
Imports System.Net
Module modCrawler
Class URLtoDest
Public strURL As String
Public strDest As String
Public Sub New(ByVal _strURL As String, ByVal _strDest As String)
strURL = _strURL
strDest = _strDest
End Sub
End Class
Class URLDownloader
Public id As Integer
Public url As URLtoDest
Public Sub New(ByVal _url As URLtoDest)
url = _url
End Sub
Public Sub Download()
Using wc As New WebClient()
wc.DownloadFile(url.strURL, url.strDest)
Console.WriteLine("Thread Finished - " & id)
End Using
End Sub
End Class
Public Sub Download(ByVal ud As URLtoDest)
Dim dldr As New URLDownloader(ud)
Dim thrd As New Thread(AddressOf dldr.Download)
dldr.id = thrd.ManagedThreadId
thrd.SetApartmentState(ApartmentState.STA)
thrd.IsBackground = False
Console.WriteLine("Starting Thread - " & thrd.ManagedThreadId)
thrd.Start()
End Sub
Sub Main()
Dim lstUD As New List(Of URLtoDest)
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file0.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file1.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file2.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file3.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file4.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file5.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file6.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file7.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file8.txt"))
lstUD.Add(New URLtoDest("http://stackoverflow.com/questions/382478/how-can-threads-be-avoided", "c:\file9.txt"))
For Each ud As URLtoDest In lstUD
Download(ud)
Next
' you will see this message in the middle of the text
' pressing a key before all files are done downloading aborts the threads that aren't finished
Console.WriteLine("Press any key to exit...")
Console.ReadKey()
End Sub
End Module

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string