Java 7 forking and joining - fork-join

I have a main thread from which I want to spawn 2 threads to parse two different xml's. I want to know if Java 7 fork-join should be used in this scenario or the traditional way that is how we used to do in jdk 1.4 enough for this case?

Fork/Join Framework is great is you have a potential tree of task and the size of this tree is unknown. Merge sort is a good example here. Having, however, only two files to parse you won't be able to utilize the key features of FJF:
Work stealing - a dynamic balancing of task queues for working threads
Easy scheduling for new tasks spawning by existing ones
You may, of course, implement it using FJF to play with nice new classes and it will do the trick. But you're unlikely to get any performance or maintainability benefits out of it, so my recomendation would be to follow a traditional approach here.

Related

Spawning 10,000 threads doesn't seem like the right approach, alternative ideas?

I am trying to simulate a decentralized system, but having trouble simulating given the real-life parameters.
Real-world:
Each module has its own computer and they communicate over a network
There can be hundreds of thousands of modules
They will communicate with each other to perform a certain action
Simulation:
Each module is considered its own thread since they are working async
Can't really spawn more than 1,000 threads
The thread to the module ratio is 1 to 1
Is spawning a thread per module the right approach? In theory, this seems like the right approach but in practice, it hits limitations at around 1,000 threads.
Your context perfectly match with the actor model
https://en.wikipedia.org/wiki/Actor_model
Explaining it through a response is impossible, start from the wiki link and search for some implementation in the language your are using, but it does what you need, you can simulate millions of "isolated states" and manage the concurrency of their mutations using very few resources (you should be able to reach 1K actors with very few threads, maybe also 2).
Also, nowadays a lot of languages offers (in their flavour) a version of lightweight threads that can be used to reduce the number of real threads used (goroutine, kotlin coroutines, java fibers, etc..)

Running multiple ANTLR4 lexer/parser instances in parallel

I am using an ANTLRv4-generated parser to process a large amount of files on a machine with multiple cores. To gain some extra speed, I would like to process files in parallel.
To check if parser performance is CPU bound, I split the files into groups and parsed them using independent processes each running the same parser in a dedicated JVM instance. This increased performance drastically.
This encouraged me to try the same using multiple threads instead of processes, however, without success. I created two worker threads, each with its own instance of parser, lexer and file-stream. The results returned are correct, however, using two threads actually takes slightly longer than using one.
To ensure that I am using threads correctly and that there is no problem with the JVM installation, I temporarily replaced the parsing code with code, which calculates Fibonacci-sequences: in that case, using multiple threads lead to a performance-increase.
Analyzing this behavior, I found that when using multiple parsing-threads, none of the CPUs reach high usage. It looks like the threads are fighting over some shared resource.
Taking a look at the ANTLR source code, I found the following comment in ParserATNSimulator.java:
"All instances of the same parser share the same decision DFAs through a static field. Each instance gets its own ATN simulator but they share the same decisionToDFA field. They also share a PredictionContextCache object that makes sure that all PredictionContext objects are shared among the DFA states. This makes a big size difference."
I am wondering whether synchronized access to these shared resources is causing the performance problems. If so, is there the possibility of creating unique instances of these resources instead? Or is there maybe even a much simpler solution to the problem?
Thanks in advance!
Fabian
The reference version of the ANTLR 4 runtime is designed to be safe when using multiple parser threads (so long as multiple parser instances are used). I maintain an alternate (unofficial) branch of ANTLR 4 which implements the core algorithms in a different way to improve performance in multithreaded scenarios.
This branch exposes a slightly different API in some areas, so it's not a drop-in replacement for the 4.0 release of ANTLR 4.
https://github.com/sharwell/antlr4/tree/optimized

How is Node.js evented system different than the actor pattern of Akka?

I've worked with Node.js for a little while and consider myself pretty good with Java. But I just discovered Akka and was immediately interested in its actor pattern (from what I understand).
Now, assuming my JavaScript skills were on par with my Scala/Java skills, I want to focus on the practicality of either system. Especially in the terms of web services.
It was my understanding that Node is excellent at handling many concurrent operations. I imagine a good Node web service for an asset management system would excel at handling many users submitting changes at the same time (in a large, heavy traffic application).
But after reading about the actors in Akka, it seams it would excel at the same thing. And I like the idea of reducing work to bite-sized pieces. Plus, years ago I dabbled in Erlang and fell in love with the message passing system it uses.
I work on many applications that deal with complex business logic and I'm thinking it's time to jump heavier into one or the other. Especially upgrading legacy Struts and C# applications.
Anyway, avoiding holy wars, how are the two systems fundamentally different? It seems both are geared towards the same goal. With maybe Akka's "self-healing" architecture having an advantage.
EDIT
It looks like I am getting close votes. Please don't take this question as a "which is better, node or akka?". What I am looking for is the fundamental differences in event driven libraries like Node and actor based ones like Akka.
Without going into the details (about which I know too little in the case of Node.js), the main difference is that Node.js supports only concurrency without parallelism while Akka supports both. Both systems are completely event-driven and can scale to large work-loads, but the lack of parallelism makes it difficult in Node.js (i.e. parallelism is explicitly coded by starting multiple nodes and dispatching requests accordingly; it is therefore inflexible at runtime), while it is quite easy in Akka due to its tunable multi-threaded executors. Given small isolated units of work (actor invocations) Akka will automatically parallelize execution for you.
Another difference of importance is that Akka includes a system for handling failure in a structured way (by having each actor supervised by its parent, which is mandatory) whereas Node.js relies upon conventions for authors to pass error conditions from callback to callback. The underlying problem is that asynchronous systems cannot use the standard approach of exceptions employed by synchronous stack-based systems, because the “calling” code will have moved on to different tasks by the time the callback’s error occurs. Having fault handling built into the system makes it more likely that applications built on that system are robust.
The above is not meant to be exhaustive, I’m sure there are a lot more differences.
I didn't yet use Akka, but it seems it's erlang-like but in java. In erlang all processes are like actors in Akka, they have mailboxes, you can send messages between them, you have supervisors etc.
Node.js uses cooperative concurrency. That means you have concurrency when you allow it (for example when you call io operation or some asynchronous event). When you have some long operation (calculating something in long loop) whole system blocks.
Erlang uses preemptive task switching. When you have long loop, system can pause it to run other operation and continue after some time. For massive concurrency Node.js is good if you do only short operations. Both support millions of clients:
http://blog.caustik.com/2012/08/19/node-js-w1m-concurrent-connections/
http://blog.whatsapp.com/index.php/2012/01/1-million-is-so-2011/
In java you need threads to do any concurrency, otherwise you can't pause execution inside function which erlang does (actually erlang pauses between function calls, but this hapens with all functions). You can pause execution between messages.
I'm not sure this is a fair comparison to draw. I read this more as "how does an evented based system compare with an actor model?". Nodejs can support an actor model just as Scala does in Akka, or C# does in Orleans, in fact check out nactor, somebody appears to already be trying it.
As for how an evented system vs. actor model compare, I would let wiser people describe it. A few brief points through about the Actor model:
Actor model is message based
Actor models tend to do well with distributed systems (clusters). Sure event based systems can be distributed, but I think the actor model has distribution built in with regards to distributing computations. A new request can be routed to a new actor in a different silo, not sure how this would work in event based.
The Actor model supports failure in that, if hey cluster 1 apperas to be down, the observer can generally find a different silo to do the work
Also, check out drama. It is another nodejs actor model implementation.

What is so great about TPL

I have done this POC and verified that when you create 4 threads and run them on Quad core machine, all cores get busy - so, CLR is already scheduling threads on different cores effectively, so why the class TASK?
I agree Task simplifies the creation and use of threads, but apart from that what? Its just a wrapper around threads and threadpools right? Or does it in some way help scheduling threads on multicore machines?
I am specifially looking at whats with Task wrt multicore that wasnt there in 2.0 threads.
"I agree Task simplifies the creation and use of threads"
Isn't that enough? Isn't it fabulous that it provides higher-level building blocks so that us mere mortals can build lock-free multithreaded code which is safe because really smart people like Joe Duffy have done the work for us?
If TPL really just consisted of a way of starting a new task, it wouldn't be much use - the work-stealing etc is nice, but probably not crucial to most of us. It's the building blocks around tasks - and in particular around the idea of a "future" - which provide the value. Do you really want to write Parallel.ForEach yourself? Do you want to want to work out how to perform partitioning efficiently yourself? I know that if I tried doing that, it would take me a long time and I'd certainly do a worse job of it than the PFX team.
Many of the advances in development haven't been about making it possible to do something which was impossible before - they've been about raising the abstraction level so that a problem can be solved once and then that solution reused. Do you feel the same way about the CLR itself? You could do the same thing in assembly yourself, obviously... but by raising the abstraction level, the CLR and C# make us more productive.
Although you could do everything equivalently in TPL or threadpool, for a better abstraction, and scalability patterns TPL is preferred over Threadpool. But it is upto the programmer, and if you know exactly what you are doing, and based on your scheduling and synchronization requirements play out in your specific application you could use Threadpool more effectively. There are some stuff you get free with TPL which you've got to code up when using Threadpool, like following few I can think of now.
work stealing
worker thread local pool
scheduling groups of actions like Parallel.For
The TPL lets you program in terms of Tasks not threads. Thinking of tasks solely as a better thread abstraction would be a mistake. Tasks allow you to specify the work you want to get executed not how you want the work executed (threads). This allows you to express the potential parallelism of your application and have the TPL runtime--the scheduler and thread pool--handle how that work gets executed. This means that the TPL will take a lot of the burden off you of having your application deal with ensuring the best perfromance on a wide variety of hardware with different numbers of cores.
For example, the TPL makes it easy to implement key several design patterns that allow you to express the potential parallelism of your application.
http://msdn.microsoft.com/en-us/library/ff963553.aspx
Like Futures (mentioned by Jon) as well as pipelines and parallel loops.

MSXML XSL Transformation multithreaded performance contention

I have a multithreaded server C++ program that uses MSXML6 and continuously parses XML messages, then applies a prepared XSLT transform to produce text. I am running this on a server with 4 CPUs. Each thread is completely independent and uses its own transform object. There is no sharing of any COM objects among the threads.
This works well, but the problem is scalability. When running:
with one thread, I get about 26 parse+transformations per second per thread.
with 2 threads, I get about 20/s/thread,
with 3 threads, 18/s/thread.
with 4 threads, 15/s/thread.
With nothing shared between threads I expected near-linear scalability so it should be 4 times faster with 4 threads than with 1. Instead, it is only 2.3 times faster.
It looks like a classic contention problem. I've written test programs to eliminate the possibility of the contention being in my code. I am using the DOMDocument60 class instead of the FreeThreadedDOMDocument one in order to avoid unnecessary locking since the documents are never shared between threads. I looked hard for any evidence of cache-line false sharing and there isn't any, at least in my code.
Another clue, the context switch rate is > 15k/s for each thread.
I am guessing the culprit is the COM memory manager or the memory manager within MSXML. Maybe it has a global lock that has to be acquired and released for every memory alloc/deallocation. I just can't believe that in this day and age, the memory manager is not written in a way that scales nicely in multithreaded multi-cpu scenarios.
Does anyone have any idea what is causing this contention or how to eliminate it?
It is fairly common for heap-based memory managers (your basic malloc/free) to use a single mutex, there are fairly good reasons for it: a heap memory area is a single coherent data structure.
There are alternate memory management strategies (e.g. hierachical allocators) that do not have this limitation. You should investigate customizing the allocator used by MSXML.
Alternatively, you should investigate moving away from a multi-threaded architecture to a multi-process architecture, with separate processes for each MSXML worker. Since your MSXML worker take string data as input and output, you do not have a serialization problem.
In summary: use a multiprocess architecture, it's a better fit to your problem, and it will scale better.
MSXML uses BSTRs, which use a global lock in its heap management. It caused us a ton of trouble for a massively multiuser app a few years ago.
We removed our use of XML in our app, you may not be able to do this, so you might be better off using an alternative XML parser.
Thanks for the answers. I ended up implementing a mix of the two suggestions.
I made a COM+ ServicedComponent in C#, hosted it as a separate server process under COM+, and used the XSLCompiledTransform to run the transformation. The C++ server connects to this external process using COM and sends it the XML and gets back the transformed string. This doubled the performance.

Resources