Multi threaded Lucene Index Writer with Spring Batch

Multi threaded Lucene Index Writer with Spring Batch - multithreading

I have a multi step Spring Batch job and in one of steps I create Lucene indices for the data read in reader so subsequent steps can search in that Lucene index.
Based on read data in ItemReader, I spread indices to few separate directories.
If I specify, Step Task Executor to be a SimpleAsyncTaskExecutor , I don't get any issue as long as indices are always written to different directories but sometimes I get a locking exception. I guess, two threads tried to write to same Index.
If I remove SimpleAsyncTaskExecutor, I don't get any issues but write becomes sequential and slow.
Is it possible to use multi threading for a Lucene Index writer if indices are being written to a single directory?
Do I need to make index creator code to be thread safe to use SimpleAsyncTaskExecutor?
index creator code is in step processor.

I am using Lucene 6.0.0 and as per IndexWriter API Doc,
NOTE: IndexWriter instances are completely thread safe, meaning
multiple threads can call any of its methods, concurrently. If your
application requires external synchronization, you should not
synchronize on the IndexWriter instance as this may cause deadlock;
use your own (non-Lucene) objects instead.
I was creating multiple instances of writer and that was causing problems. Single writer instance can be passed to as many threads as you like provided rest of the code around that writer is thread safe.
I used a single writer instance and parallelized chunks. Each parallel chunk wrote to same directory without any issues.
To parallelize chunks, I had to made my chunk components - reader , processor and writer to be thread safe.

Related

How can I initialize a Node.js workerpool with common data?

I have an Node.js AWS Lambda function which processes an array. The number of entries in the array can vary from just a few to as many as several hundred million. Processing each iteration takes milliseconds and makes use of shared data (i.e., each iteration uses the exact same input data except for a key). The shared data is non-trivial in size. The results of some iterations are collected into an array of results. The results of any one iteration do not have an impact on any other iteration (i.e., each is atomic).
To improve performance, I want to make this process multi-threaded. Since there will be very many jobs and each will do the exact same task, a Workerpool seemed right. I created one with n-1 (n = # processes) workers. It works, but the processing takes longer in my pool version than in my non-pool version. Research leads me to suspect the issue is the size of the shared data passed to during each invocation.
Ideally, I'd like to initialize each worker with the same shared input and just pass the relevant key with each invocation. However, I don't see an option to pass shared data when initializing the pool (https://github.com/josdejong/workerpool).
I found this 2014 SO question (Share objects in nodejs between different instances) that has a comment indicating that it isn't possible in Node.js ("Thanks for the great stuff but i was looking for simple solution like the threads in .net and java which support sharing objects between different threads").
Does Node.js support initializing workers in a workerpool with shared data? I'm new to multi-threading and would appreciate any guidance/links.

Inter-thread communication in Swift?

My goal is to parse a large XML file (20 GB) with Swift. There are some performance issues with NSXMLParser and bridging to Swift objects, so I'm looking at multi-threading. Specifically the following division:
Main thread - parses data
Worker thread - casts ObjC types into Swift types and sends to 1. The casting of ObjC NSDictionary to [String: String] is the largest bottleneck. This is also the main reason for separating onto multiple threads.
Worker thread - parses XML into ObjC types - and sends to 2. NSXMLParser is a push-parser, once it starts parsing, you cannot pause it.
The data should be parsed sequentially, so the input ordering should be maintained. My idea is to run an NSRunLoop on both 1 and 2, allowing parallel processing without blocking. According to Apple's documentation, communication between the threads can be achieved by calling performSelector:onThread:withObject:waitUntilDone:. However this symbol is not available in Swift.
I don't think that GCD would fit as a solution. Both worker threads should be long-running processes with new work coming in at random intervals.
How can one achieve the above (e.g. NSRunLoops on multiple threads) using Swift?

I used NSOperation for the first time last month, and it's is a really easy object to subclass, you could either chain them together with completion blocks, or you can set operations to be dependencies of each other so that they're performed sequentially.
It's also pretty easy to communicate with NSOperations by passing in objects to them.
NSHipster: http://nshipster.com/nsoperation/

What is the general design ideas of read-compute-write thread-safe program based on it's single-threaded version?

Consider that the sequental version of the program already exists and implements a sequence of "read-compute-write" operations on a single input file and other single output file. "Read" and "write" operations are performed by the 3rd-party library functions which are hard (but possible) to modify, while the "compute" function is performed by the program itself. Read-write library functions seems to be not thread-safe, since they operate with internal flags and internal memory buffers.
It was discovered that the program is CPU-bounded, and it is planned to improve the program by taking advantage of multiple CPUs (up to 80) by designing the multi-processor version of the program and using OpenMP for that purpose. The idea is to instantiate multiple "compute" functions with same single input and single output.
It is obvious that something nedds to be done in insuring the consistent access to reads, data transfers, computations and data storages. Possible solutions are: (hard) rewrite the IO library functions in thread-safe manner, (moderate) write a thread-safe wrapper for IO functions that would also serve as a data cacher.
Is there any general patterns that cover the subject of converting, wrapping or rewriting the single-threaded code to comply with OpenMP thread-safety assumptions?
EDIT1: The program is fresh enough for changes to make it multi-threaded (or, generally a parallel one, implemented either by multi-threading, multi-processing or other ways).

As a quick response, if you are processing a single file and writing to another, with openMP its easy to convert the sequential version of the program to a multi-thread version without taking too much care about the IO part, provided that the compute algorithm itself can be parallelized.
This is true because usually the main thread, takes care of the IO. If this cannot be achieved because the chunks of data are too big to read at once, and the compute algorithm cannot process smaller chunks, you can use the openMP API to synchronize the IO in each thread. This does not mean that the whole application will stop or wait until the other threads finish computing so new data can be read or written, it means that only the read and write parts need to be done atomically.
For example, if the flow of your sequencial application is as follows:
1) Read
2) compute
3) Write
Given that it truly can be parallelized, and each chunk of data needs to be read from within each thread, each thread could follow the next design:
1) Synchronized read of chunk from input (only one thread at the time could execute this section)
2) Compute chunk of data (done in parallel)
3) Synchronized write of computed chunk to output (only one thread at the time could execute this section)
if you need to write the chunks in the same order you have read them, you need to buffer first, or adopt a different strategy like fseek to the correct position, but that really depends if the output file size is known from the start, ...
Take special attention to the openMP scheduling strategy, because the default may not be the best to your compute algorithm. And if you need to share results between threads, like the offset of the input file you have read, you may use reduction operations provided by the openMP API, which is way more efficient than making a single part of your code run atomically between all threads, just to update a global variable, openMP knows when its safe to write.
EDIT:
In regards of the "read, process, write" operation, as long as you keep each read and write atomic between every worker, I can't think any reason you'll find any trouble. Even when the data read is being stored in a internal buffer, having every worker accessing it atomically, that data is acquired in the exact same order. You only need to keep special attention when saving that chunk to the output file, because you don't know the order each worker will finish processing its attributed chunk, so, you could have a chunk ready to be saved that was read after others that are still being processed. You just need each worker to keep track of the position of each chunk and you can keep a list of pointers to chunks that need to be saved, until you have a sequence of finished chunks since the last one saved to the output file. Some additional care may need to be taken here.
If you are worried about the internal buffer itself (and keeping in mind I don't know the library you are talking about, so I can be wrong) if you make a request to some chunk of data, that internal buffer should only be modified after you requested that data and before the data is returned to you; and as you made that request atomically (meaning that every other worker will need to keep in line for its turn) when the next worker asks for his piece of data, that internal buffer should be in the same state as when the last worker received its chunk. Even in the case that the library particularly says it returns a pointer to a position of the internal buffer and not a copy of the chunk itself, you can make a copy to the worker's memory before releasing the lock on the whole atomic read operation.
If the pattern I suggested is followed correctly, I really don't think you would find any problem you wouldn't find in the same sequential version of the algorithm.

with a little of synchronisation you can go even further. Consider something like this:
#pragma omp parallel sections num_threads
{
#pragma omp section
{
input();
notify_read_complete();
}
#pragma omp section
{
wait_read_complete();
#pragma omp parallel num_threads(N)
{
do_compute_with_threads();
}
notify_compute_complete();
}
#pragma omp section
{
wait_compute_complete();
output();
}
}
So, the basic idea would be that input() and output() read/write chunks of data. The compute part then would work on a chunk of data while the other threads are reading/writing. It will take a bit of manual synchronization work in notify*() and wait*(), but that's not magic.
Cheers,
-michael

Lucene NIOFSDirectory and SimpleFSDirectory with multiple threads

My basic question is: what's the proper way to create/use instances of NIOFSDirectory and SimpleFSDirectory when there's multiple threads that need to make queries (reads) on the same index. More to the point: should an instance of the XXXFSDirectory be created for each thread that needs to do a query and retrieve some results (and then in the same thread have it closed immediatelly after), or should I make a "global" (singleton?) instance which is passed to all threads and then they all use it at the same time (and it's no longer up to each thread to close it when it's done with a query)?
Here's more details:
I've read the docs on both NIOFSDirectory and SimpleFSDirectory and what I got is:
they both support multithreading:
NIOFSDirectory : "An FSDirectory implementation that uses java.nio's FileChannel's positional read, which allows multiple threads to read from the same file without synchronizing."
SimpleFSDirectory : "A straightforward implementation of FSDirectory using java.io.RandomAccessFile. However, this class has poor concurrent performance (multiple threads will bottleneck) as it synchronizes when multiple threads read from the same file. It's usually better to use NIOFSDirectory or MMapDirectory instead."
NIOFSDirectory is better suited (basically, faster) than SimpleFSDirectory in a multi threaded context (see above)
NIOFSDIrectory does not work well on Windows. On Windows SimpleFSDirectory is recomended. However on *nix OS NIOFSDIrectory works fine, and due to better performance when multi threading, it's recommended over SimpleFSDirectory.
"NOTE: NIOFSDirectory is not recommended on Windows because of a bug in how FileChannel.read is implemented in Sun's JRE. Inside of the implementation the position is apparently synchronized."
The reason I'm asking this is that I've seen some actual projects, where the target OS is Linux, NIOFSDirectory is used to read from the index, but an instance of it is created for each request (from each thread), and once the query is done and the results returned, the thread closes that instance (only to create a new one at the next request, etc). So I was wondering if this is really a better approach than to simply have a single NIOFSDirectory instance shared by all threads, and simply have it opened when the application starts, and closed much later on when a certain (multi threaded) job is finished...
More to the point, for a web application, isn't it better to have something like a context listener which creates an instance of NIOFSDirectory , places it in to the Application Context, all Servlets share and use it, and then the same context listener closes it when the app shuts down?

Official Lucene FAQ suggests the following:
Share a single IndexSearcher across queries and across threads in your
application.
IndexSearcher requires single IndexReader and the latter can be produced with a DirectoryReader.open(Directory) which would only require a single instance of Directory.

concurrent saving from two different threads to Core Data persistant store with unique entity Id

I'm implementing multithreaded core data downloader.
I have a problem with doubling objects while saving objects with unique string attribute in Entity.
If 2 threads are downloading from the same url simultaneously (f.e., updater-timer fires and application enters foreground - so user calls update method), I cant check existanse of object with unique attribute value in persistant store, so objects are doubling.
How can I avoid doubling objects and what is the best solution in terms of performance?
description: (sorry, I cant post images yet)
http://i.stack.imgur.com/yMBgQ.png

Another approach would be to perform the download/save within an NSOperation, and prior to adding an operation to the queue, you could check to see if there was an existing operation to download that URL in the NSOperationQueue.
The advantage of this approach is that you don't download any more data than is necessary.

I've run into this before and it's a tricky problem.
I solved it by performing by downloads in separate background threads (the same as you are doing now) but all code data write operations happen on a global NSOperation queue with numConcurrentOperations set to 1. When each background download was complete it created an NSOperation and put it onto that queue.
Good: Very simple thread safety - the NSOperationQueue ensured that only one thread was writing to CoreData at any one point.
Bad: Slight hit in terms of performance because the Core Data operations were working in series, not in parallel. This can be mitigated by doing any calculations needed on the data in the download background thread and doing as little as possible in the Core Data operation.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string