Running multiple ANTLR4 lexer/parser instances in parallel - multithreading

I am using an ANTLRv4-generated parser to process a large amount of files on a machine with multiple cores. To gain some extra speed, I would like to process files in parallel.
To check if parser performance is CPU bound, I split the files into groups and parsed them using independent processes each running the same parser in a dedicated JVM instance. This increased performance drastically.
This encouraged me to try the same using multiple threads instead of processes, however, without success. I created two worker threads, each with its own instance of parser, lexer and file-stream. The results returned are correct, however, using two threads actually takes slightly longer than using one.
To ensure that I am using threads correctly and that there is no problem with the JVM installation, I temporarily replaced the parsing code with code, which calculates Fibonacci-sequences: in that case, using multiple threads lead to a performance-increase.
Analyzing this behavior, I found that when using multiple parsing-threads, none of the CPUs reach high usage. It looks like the threads are fighting over some shared resource.
Taking a look at the ANTLR source code, I found the following comment in ParserATNSimulator.java:
"All instances of the same parser share the same decision DFAs through a static field. Each instance gets its own ATN simulator but they share the same decisionToDFA field. They also share a PredictionContextCache object that makes sure that all PredictionContext objects are shared among the DFA states. This makes a big size difference."
I am wondering whether synchronized access to these shared resources is causing the performance problems. If so, is there the possibility of creating unique instances of these resources instead? Or is there maybe even a much simpler solution to the problem?
Thanks in advance!
Fabian

The reference version of the ANTLR 4 runtime is designed to be safe when using multiple parser threads (so long as multiple parser instances are used). I maintain an alternate (unofficial) branch of ANTLR 4 which implements the core algorithms in a different way to improve performance in multithreaded scenarios.
This branch exposes a slightly different API in some areas, so it's not a drop-in replacement for the 4.0 release of ANTLR 4.
https://github.com/sharwell/antlr4/tree/optimized

Related

Multithreading vs Shared Memory

I have a problem which is essentially a series of searches for multiple copies of items (needles) in a massive but in memory database (10s of Gb) - the haystack.
This is divided into tasks where each task is to find each of a series of needles in the haystack
and each task is logically independent from the other tasks.
(This is already distributed across multiple machines where each machine
has its own copy of the haystack.)
There are many ways this could be parallelized on individual machines.
We could have one search process per CPU core sharing memory.
Or we could have one search process with multiple threads (one per core). Or even several multi-threaded processes.
3 possible architectures:
A process loads the haystack into Posix shared memory.
Subsequent processes use the shared memory segment instead (like a cache)
A process loads the haystack into memory and then forks.
Each process uses the same memory because of copy on write semantics.
A process loads the haystack into memory and spawns multiple search threads
The question is one method likely to be better and why? or rather what are the trade offs.
(For argument's sake assume performance trumps implementation complexity).
Implementing two or three and measuring is possible of course but hard work.
Are there any reasons why one might be definitively better?
Data in the haystack is immutable.
The processes are running on Linux. So processes are not significantly more expensive than threads.
The haystack spans many GBs so CPU caches are not likely to help.
The search process is essentially a binary search (actually equal_range with a touch of interpolation).
Because the tasks are logically independent there is no benefit from inter-thread communication being
cheaper than inter-process communication (as per for example https://stackoverflow.com/a/18114475/1569204).
I cannot think of any obvious performance trade-offs between threads and shared memory here. Are there any? Perhaps the code maintenance trade-offs are more relevant?
Background research
The only relevant SO answer I could find refers to the overhead of synchronising threads - Linux: Processes and Threads in a Multi-core CPU - which is true but less applicable here.
Related and interesting but different questions are:
Multithreading: What is the point of more threads than cores?
Performance difference between IPC shared memory and threads memory
performance - multithreaded or multiprocess applications
An interesting presentation is https://elinux.org/images/1/1c/Ben-Yossef-GoodBadUgly.pdf
It suggests there can be a small difference in the speed of thread vs process context switches.
I am assuming that except for a monitoring threads/process the others are never switched out.
General advise: Be able to measure improvements! Without that, you may tweak all you like based on advise off the internet but still don't get optimal performance. Effectively, I'm telling you not to trust me or anyone else (including yourself) but to measure. Also prepare yourself for measuring this in real time on production systems. A benchmark may help you to some extent, but real load patterns are still a different beast.
Then, you say the operations are purely in-memory, so the speed doesn't depend on (network or storage) IO performance. The two bottlenecks you face are CPU and RAM bandwidth. So, in order to work on the right part, find out which is the limiting factor. Making sure that the according part is efficient ensures optimal performance for your searches.
Further, you say that you do binary searches. This basically means you do log(n) comparisons, where each comparison requires a load of a certain element from the haystack. This load probably goes through all caches, because the size of the data makes cache hits very unlikely. However, you could hold multiple needles to search for in cache at the same time. If you then manage to trigger the cache loads for the needles first and then perform the comparison, you could reduce the time where either CPU or RAM are idle because they wait for new operations to perform. This is obviously (like others) a parameter you need to tweak for the system it runs on.
Even further, reconsider binary searching. Binary searching performs reliably with a good upper bound on random data. If you have any patterns (i.e. anything non-random) in your data, try to exploit this knowledge. If you can roughly estimate the location of the needle you're searching for, you may thus reduce the number of lookups. This is basically moving the work from the RAM bus to the CPU, so it again depends which is the actual bottleneck. Note that you can also switch algorithms, e.g. going from an educated guess to a binary search when you have less than a certain amount of elements left to consider.
Lastly, you say that every node has a full copy of your database. If each of the N nodes is assigned one Nth of the database, it could improve caching. You'd then make one first step at locating the element to determine the node and then dispatch the search to the responsible node. If in doubt, every node can still process the search as a fallback.
The modern approach is to use threads and a single process.
Whether that is better than using multiple processes and a shared memory segment might depend somewhat on your personal preference and how easy threads are to use in the language you are using, but I would say that if decent thread support is available (e.g. Java) you are pretty much always better off using it.
The main advantage of using multiple processes as far as I can see is that it is impossible to run into the kind of issues you can get when managing multiple threads (e.g., forgetting to synchronise access to shared writable resources - except for the shared memory pool). However, thread-safety by not having threads at all is not much of an argument in favour.
It might also be slightly easier to add processes than add threads. You would have to write some code to change the number of processing threads online (or use a framework or application server).
But overall, the multiple-process approach is dead. I haven't used shared memory in decades. Threads have won the day and it is worth the investment to learn to use them.
If you do need to have multi-threaded access to common writable memory then languages like Java give you all sorts of classes for doing that (as well as language primitives). At some point you are going to find you want that and then with the multi-process approach you are faced with synchronising using semaphores and writing your own classes or maybe looking for a third party library, but the Java people will be miles ahead by then.
You also mentioned forking and relying on copy-on-write. This seems like a very fragile solution dependent on particular behaviour of the system and I would not myself use it.

Multithreading in AWS

I am a new user of AWS EC2 and I am going to deploy a mostly IO-bounded application on a linux-based EC2 m4.large instance. As far as I can read on AWS Instances sheet, available here, I have 2 vCPUs, which means I have two hyperthreads running on 1 physical CPU. Therefore, my question and my doubts deal with multithreading processing. According to me, the maximum number of threads I d be able to use should be 2, but i was wondering if there were any guidelines about multithreading computing on AWS instances. Basically, my application reads a big file (1.5+ GB) and then it needs to process its chunks. I was thinking of implementing either a producer-consumer pattern (1 thread reading and 1 processing) or using a map-like approach (every thread opens the file and seeks on its partition). I know that these two approaches may have different complexities but I am interested in performances, thus, I need to squeeze as much speed as possible!! Thank you in advance.
If your application is IO bound, using multithreaded processing is likely to be of limited utility, since multithreading is primarily useful for optimizing computation rather than IO. However, if you really want to get every last bit of speed, your best bet is to program it both ways and see which works better under your particular circumstances.

Java 7 forking and joining

I have a main thread from which I want to spawn 2 threads to parse two different xml's. I want to know if Java 7 fork-join should be used in this scenario or the traditional way that is how we used to do in jdk 1.4 enough for this case?
Fork/Join Framework is great is you have a potential tree of task and the size of this tree is unknown. Merge sort is a good example here. Having, however, only two files to parse you won't be able to utilize the key features of FJF:
Work stealing - a dynamic balancing of task queues for working threads
Easy scheduling for new tasks spawning by existing ones
You may, of course, implement it using FJF to play with nice new classes and it will do the trick. But you're unlikely to get any performance or maintainability benefits out of it, so my recomendation would be to follow a traditional approach here.

Threading run time without adding extra lines in program

Is there any thread library which can parse through code and find blocks of code which can be threaded and accordingly add the required threading instructions.
Also I want to check performance of a multithreaded program as compared to its single thread version. For this I would need to monitor the CPU usage(how much each processor is getting used). Is there any tool available to do this?
I'd say the decision whether or not a given block of code can be rewritten to be multi-threaded is way too hard for an automated process to make. To make matters worse, multi-threaded code typically accesses resources outside its own scope, such as pulling data over the network, loading large files, waiting for events, executing database queries, etc.; without detailed information about all these external factors, it is impossible to decide where to go multithreaded, simply because not all the required information is in the code.
Also, a lot of code that is multi-threadable in theory will not run faster if multi-threaded, but in fact slow down.
Some compilers (such as recent versions of the Intel compiler and gcc) can automatically parallelize simple loops, but anything beyond that is too complex. On the other hand, there are task libraries that use thread pools, and will automatically scale the number of threads to the available processors, and divide the work between them. Of course, using such a library will require rewriting your code to do so.
Structuring your application to make best use of multithreading is not a simple matter, and requires careful thought about which parts of your application can best make use of it. This is not something that can be automated.
Consider multi-threading as an approach to make full utilization of available resources. This is when it works the best. Consider an application which has multiple modules/areas which are multi-threadable. If all of them are made multi-threaded, the available resources might go down substantially. This could at times be detrimental to the application itself. Thus, multi-threading has to be used very carefully.
As Chris mentioned, there are a lot of profilers which do profiling for given combination of OS/language.
The first thing you need to do is profile your code in a single thread and see if the areas you think are good candidates for multithreading are actually a problem. It's easy to waste a lot of time multithreading working code only to end up with a buggy mess that's slower than the original implementation if you don't carefully consider the problem first.

MSXML XSL Transformation multithreaded performance contention

I have a multithreaded server C++ program that uses MSXML6 and continuously parses XML messages, then applies a prepared XSLT transform to produce text. I am running this on a server with 4 CPUs. Each thread is completely independent and uses its own transform object. There is no sharing of any COM objects among the threads.
This works well, but the problem is scalability. When running:
with one thread, I get about 26 parse+transformations per second per thread.
with 2 threads, I get about 20/s/thread,
with 3 threads, 18/s/thread.
with 4 threads, 15/s/thread.
With nothing shared between threads I expected near-linear scalability so it should be 4 times faster with 4 threads than with 1. Instead, it is only 2.3 times faster.
It looks like a classic contention problem. I've written test programs to eliminate the possibility of the contention being in my code. I am using the DOMDocument60 class instead of the FreeThreadedDOMDocument one in order to avoid unnecessary locking since the documents are never shared between threads. I looked hard for any evidence of cache-line false sharing and there isn't any, at least in my code.
Another clue, the context switch rate is > 15k/s for each thread.
I am guessing the culprit is the COM memory manager or the memory manager within MSXML. Maybe it has a global lock that has to be acquired and released for every memory alloc/deallocation. I just can't believe that in this day and age, the memory manager is not written in a way that scales nicely in multithreaded multi-cpu scenarios.
Does anyone have any idea what is causing this contention or how to eliminate it?
It is fairly common for heap-based memory managers (your basic malloc/free) to use a single mutex, there are fairly good reasons for it: a heap memory area is a single coherent data structure.
There are alternate memory management strategies (e.g. hierachical allocators) that do not have this limitation. You should investigate customizing the allocator used by MSXML.
Alternatively, you should investigate moving away from a multi-threaded architecture to a multi-process architecture, with separate processes for each MSXML worker. Since your MSXML worker take string data as input and output, you do not have a serialization problem.
In summary: use a multiprocess architecture, it's a better fit to your problem, and it will scale better.
MSXML uses BSTRs, which use a global lock in its heap management. It caused us a ton of trouble for a massively multiuser app a few years ago.
We removed our use of XML in our app, you may not be able to do this, so you might be better off using an alternative XML parser.
Thanks for the answers. I ended up implementing a mix of the two suggestions.
I made a COM+ ServicedComponent in C#, hosted it as a separate server process under COM+, and used the XSLCompiledTransform to run the transformation. The C++ server connects to this external process using COM and sends it the XML and gets back the transformed string. This doubled the performance.

Resources