Threads, Coro, Anyevent confusion - multithreading

I am relatively new to perl and even newer to threading in perl. I have a perl script that takes input from 3 different sources. (2 LDAP queries and a file that isn't always there) Because some parts can take longer than others so I decided to use threads and queues. During development, testing individual components of the script worked very well, but after putting it all together the performance seems to degrade.
Basic structure is this
2 threads:(Read file or Read AD entries) -> Queue1 -> 2 threads:(scrub data) -> Queue2 -> 3-4 threads(compare against existing local LDAP entries). Several threads report statistics back to the main script and once all threads are done an email is sent with all the stats and status of that run.
I am using dequeue_nb and I thought that would help but no luck.
The performance hit seems to be in the queues. While looking for tips to improve performance I've run into several articles saying perl threads are no good and to use coro, POE, Anyevent, IO:async, etc.
This doesn't seem like a "event" problem so I didn't think AnyEvent or POE would be the way to go by from what I'm seeing, coros only seems to use one CPU at a time so I'm not sure this would work either. I thought about using a combination of them but then my head started to hurt. Does anyone have any suggestions on how to either fix/troubleshoot my script or any suggestions how to implement one of the other modules?

A problem with parallelism is synchronization. It is a performance killer, it is bad, it is to be avoided if possible.
OPs architecture
Lets look at your architecture:
+--------------+--------------+
| Input 1 | Input 2 |
+--------------+--------------+
| QUEUE A |
+--------------+--------------+
| Scrub 1 | Scrub 2 |
+--------------+--------------+
| QUEUE B |
+---------+---------+---------+
| Compare | Compare | Compare |
+---------+---------+---------+
Discussion
Queue A has to be synchronized across four threads; Queue B across 5-6. Only one thread can access the Queue at any time, so most of the time your threads will be waiting, not working!
Parallel Pipeline Architecture
A somewhat different architecture could look like this:
+-----------+ +-----------+
| Input 1 | | Input 2 |
+-----------+ +-----------+
| QUEUE 1A | | QUEUE 2A |
+-----------+ +-----------+
| Scrub 1 | | Scrub 2 |
+-----------+ +-----------+
| QUEUE 1B | | QUEUE 2B |
+-----+-----+ +-----+-----+
| Cmp | Cmp | | Cmp | Cmp |
+-----+-----+ +-----+-----+
Discussion
Here, the A Queues are only affiliated with two threads (->less waiting), the B Queues only with three. This architecture should perform faster for similar input size/complexity. If Input 2 were considerably shorter, the whole Pipeline 2 would have run before Pipeline 1 is even half finished. It is, however, far better than Using a single process for each Pipeline.
The Lawn Sprinkler Architecture
Concept
An even better architecture would try to distribute the output of a process across multiple Queues. (The reverse, getting threads fetch their input from multiple queues is bad when a queue is empty.)
Each Queue write should go to a different queue:
+-----------+ +-----------+
| Input 1 | | Input 2 |
+-----------+ +-----------+
| \ / |
+-----------+ +-----------+
| QUEUE 1A | | QUEUE 2A |
+-----------+ +-----------+
| Scrub 1 | | Scrub 2 |
+-----------+ +-----------+
/ | \ \ / / | \
+-------+-------+-------+-------+
| Q. 1B | Q. 2B | Q. 3B | Q. 4B |
+-------+-------+-------+-------+
| Cmp | Cmp | Cmp | Cmp |
+-------+-------+-------+-------+
This makes sure each thread has the same workload, but it cannot make sure that all threads finish at the same time.
Discussion
All Queues are shared among 3 Threads. The problem is that two Threads will block each other when writing to a queue. If the time between Queue write accesses is significantly larger than the write duration, this should be no problem, else the second architecture can be mixed in.
So if this architecture makes sense depends on your exact requirements.
It is slower for evenly sized inputs, but performs better on irregular input.
Appendices
On implementing:
What framework is used is secondary to the architecture. If you only pass around text strings, I strongly advise using pipes. If you have to pass Perl data types or objects, you probably have to embrace the additional overhead of using a real Queue: When adding an unshared variable to a queue, a deep copy has to be made (see #Leon Timmermans answer) in addition to all the synchronization overhead.
On scalability:
Architecture 1 and 3 are not fixed in the number of threads. I strongly suggest using this flexibility to benchmark different compositions. A rule of thumb is that you should use n to 2n threads where n is the number of processors (or hardware threads). This can be seen as a maximal sensible number for the threads of one stage. Above that, you only get a memory penalty and no speedup. A performance saturation point may be reached earlier, when a stage can process the input faster than it is supplied.

What kind of data are you putting in the queues? AFAIK simple data is cheaper than complex structures, since it needs to be clones and copied at least twice. I've been planning to write a faster queue implementation (most of the work is already done actually), but haven't published that yet.

Related

What's the Mark-Compact algorithm used by HotSpot?

When reading the Mark-Compact chapter on The Garbage Collection Handbook, a sequence of alternatives were presented, but most of them looked old / theoretical (for instance, the 2-finger compaction and the Lisp2 3-pass approach requiring an extra header word per object).
Is anyone aware of what algorithm does HotSpot uses when running Mark-Compact (in its old-generation, I assume)?
Thanks
Big disclaimer: I am not a GC expert/writer; all the things that are written bellow are subject to changes and some of them might be way too simplistic. Please take this with a grain of salt.
I will only speak about Shenandoah, as I think I understand it; which is not a generational GC.
There are two phases here actually: Mark and Compact. I would strongly emphases here that both are concurrent and do happen while your application is running (with some very short STW events).
And now to the details. I have explained a bit of things here, but because that answer is related to somehow a different question; I'll explain more here. I assume that traversing the graph of live objects is no news for you, after all you are reading a book about GC. As that answer explains, when the application is fully stopped (also called brought to safe-points), identifying live objects is easy. No one is changing anything under you feet, the floor is rigid and you control everything. Parallel collectors do this.
The really painful way is to do things concurrent. Shenandoah employs an algorithm called Snapshot at the beginning (that book explains it AFAIK), will call it SATB for short. Basically this algorithm is implemented like this: "I will start to scan concurrently the graph of objects (from GC roots), if anything changes while I scan, I will not alter the heap, but will record these changes and deal with them later".
The very first part that you need to question is : while I scan. How is that achieved? Well, before doing the concurrent mark, there is a STW event called Initial Mark. One of the things that gets done in that phase is to set a flag that concurrent marking has started. Later, while executing code, that flag is checked (Shenandoah thus employs changes in the interpreter). In pseudo-code:
if(!concurrentMarkingActive) {
// do whatever you were doing and alter the heap
} else {
// shenandoah magic
}
In machine code that might look like this:
test %r11, %r11 (test concurrentMarkingActive flag)
jne // concurrent marking is currently active
Now GC knows when concurrent marking is happening.
But how is concurrent marking even implemented. How can you scan the heap while the heap itself is mutated (not stable)? The floor under your feet adds more holes and removes them also.
That is the "shenandoah magic". Changes to the heap are "intercepted" and not persisted directly. So if GC performs a concurrent mark at this point in time, and application code tries to mutate the heap, those changes are recorded in each threads SATB queues (snapshot at the beginning). When concurrent mark is over, those queues are drained (via a STW event called Final Mark) and those changes that were drained are analyzed again (remember under a STW event now).
When this phase Final Mark is over GC knows what is alive and thus what is implicitly garbage.
Compact phase is next. Shenandoah is now supposed to move live objects to different regions (in a compact fashion) and mark the current region as one where we can allocate again. Of course, in a simple STW phase, this would be easy : move the object, update references pointing to it. Done. When you have to do it concurrently...
You can't take the object and simply move it to a different region and then update your references one by one. Think about it, let's suppose this is the first state we have:
refA, refB
|
---------
| i = 0 |
| j = 0 |
---------
There are two references to this instance: refA and refB. We create a copy of this object:
refA, refB
|
--------- ---------
| i = 0 | | i = 0 |
| j = 0 | | j = 0 |
--------- ---------
We created a copy, but have not updated any references yet. We now move a single reference to point to the copy:
refA refB
| |
--------- ---------
| i = 0 | | i = 0 |
| j = 0 | | j = 0 |
--------- ---------
And now the interesting part: ThreadA does refA.i = 5, while ThreadB does refB.j = 6 so your state becomes:
refA refB
| |
--------- ---------
| i = 5 | | i = 0 |
| j = 0 | | j = 6 |
--------- ---------
How do you merge these objects now? I'll be honest - I have no idea if that would even be possible and neither is this a route that Shenandoah took.
Instead, the solution from Shenandoah does a very interesting thing IMHO. An extra pointer added to each instance, also called forwarding pointer:
refA, refB
|
fwdPointer1
|
---------
| i = 0 |
| j = 0 |
---------
refA and refB points to fwdPointer1, while fwdPointer1 to the real Object. Let's create the copy now:
refA, refB
|
fwdPointer1 fwdPointer2
| |
--------- ---------
| i = 0 | | i = 0 |
| j = 0 | | j = 0 |
--------- ---------
And now, we want to switch all references (refA and refB) to point to the copy. If you look closely, this requires only a single pointer change - fwdPointer1. Make fwdPointer1 point to fwdPointer2 and you are done. This means one single change as opposed to two (in this set-up) of refA and refB. The bigger win here is that you don't need to scan the heap and find out references that point to your instance.
Is there a way to atomically update a reference? Of course : AtomicReference (at least in java). The idea here is almost the same, we atomically change the fwdPointer1 via a CAS (compare and swap), as such:
refA, refB
|
fwdPointer1 ---- fwdPointer2
|
--------- ---------
| i = 0 | | i = 0 |
| j = 0 | | j = 0 |
--------- ---------
So, refA and refB point to fwdPointer1, which now points to the copy we have created. Via a single CAS operation, we have switched concurrently all references to the newly created copy.
Then, GC can simply (concurrently) update all references refA and refB to point to the fwdPointer2. In the end having this:
refA, refB
|
fwdPointer1 ---- fwdPointer2
|
--------- ---------
| i = 0 | | i = 0 |
| j = 0 | | j = 0 |
--------- ---------
So, the Object on the left is now garbage: there are no references pointing to it.
But, we need to understand the drawbacks, there is no free lunch.
First, is obvious : Shenandoah adds a machine header that each instance in the heap (read further, as this is false; but makes understanding easier).
Each of these copies will generate an extra object in the new region, thus at some point there will be at least two copies of the same object (extra space required for Shenandoah to function, as such).
When ThreadA does refA.i = 5 (from the previous example), how does it know if it should try to create a copy, write to that copy and CAS that forwarding pointer vs simply do a write to the object? Remember that this happens concurrently. Same solution as with concurrentMarkingActive flag. There is a flag isEvacuationToADifferentRegionActive (not the actual name). If that flag is true => Shenandoah Magic, else simply do the write as it.
If you really understood this last point, your natural question should be :
"WAIT A SECOND! Does this mean that Shenandoah does an if/else against isEvacuationToADifferentRegionActive for EACH AND SINGLE write to an instance - be that primitive or reference? Also does that mean that EACH read must be accessed via the forwarding pointer?"
The answer used to be YES; but things have changed: via this issue (though I make it sound a lot worse than it really is). Now they use Load barriers for the entire Object, more details here. Instead of having a barrier on each write (that if/else against the flag) and a dereference via the forwarding pointer for each read, they moved to a load barrier. Basically do that if/else only when you load the object. Since writing to it implies reading first, they thus preserve "to-space invariant". Apparently this is simpler, better and easier to optimize. Hooray!
Remember that forwarding pointer? Well, it does not exist anymore. I don't understand the details in its entire glory (yet), but it has to do something with the possibility to use the mark word and the from space that, since the addition of load barriers, is not used anymore. A lot more details here. Once I understand how that really works internally, will update the post.
G1 is not VERY much different than what Shenandoah is, but the devil is in the details. For example Compact phase in G1 is a STW event, always. G1 is always generational - even if you want that or not (Shenandoah can be sort of like that - there is a setting to control this), etc.

Questasim - Is it possible to log and reload signals on new design?

I am running a test (UVM) with lot of components. It is a Top-Level test, however I am debugging an internal module and I am only interested in the signals of the interfaces connected to that module. Since it is a TL it takes long time since I get to the point in time I am interested in. Those signals are product of other modules but I am not interested in those right now.
At the moment I am using Questa sim, so I was wondering if there is a way of storing the events from those signals so that I can rerun again only those. Hence my intention is to change the module, recompile and directly use the inputs on the new version without having to actually run the whole test and wait that long.
Inside a big chip company I used to work at, they call it "Save and Restore". Not sure what your EDA vendor calls it. You should be able to take a "Vector Change Dump" or "VCD" file of the signal snapshot at the end of the bootup simulation and convert that to a bunch of 0-time puts on the wires. You may have to force the wires for a few clocks and then release the force's.
In regards to your comment about interacting with UVM testing infrastructure, I'm not exactly sure on the behavior of multiple puts or forces on one node. I would guess that the last one wins. However, forces are very very very node specific. The reset force will win and be latched into design if your if it is down stream. If your design looks like this, then the force <path> 0 from the reset code will win, because it is downstream:
+--------------------------------------------+
| TopDesign.sv +------------------------+ |
| | SubBlock.sv | |
| | | |
1 | 1 | 0 +--------------+ 0 | |
----->---------------->----> register Foo >-- | |
^ | | ^ | | | |
UVM Driver | | | +--------------+ | |
| +--|---------------------+ |
+-------------------|------------------------+
|0
Reset force
If your UVM infrastructure forces on an interface and then your reset initialization force is on a downstream node, which will synthesize to the same wire, then the downstream node force will win, because this will actually be flopped into the logic.
You still have to take care of initializing the UVM checkers or scoreboards into a post-reset state.

JVM parallel object creation performance

It is said that Java allows to run multiple threads in parallel. It also says that object creation is cheap so that I must always prefer creating new object to reusing them. But, to my knowledge, the objects are created in the global scope (to become a subject to GC). Now, comes the wonder, is parallelism stopped when any of the threads creates an object?
AFAIK, unmanaged languages create the objects on the thread stack so that threads keep running independently. They are all removed once you exit from the subprogram scope. That is you need not add the objects into global list and stop the machine to GC them later. You could do the same with Int/String-like immutable objects in Java, because thy cannot refer other objects creating circular dependencies, that need GC to cleanup. But, afaik, nothing is cleaned up at procedure exit in Java.
Allocation of small objects is quite cheap most of the time because of TLAB (Thread Local Allocation Buffer). Every single thread has a special area in Eden reserved for thread-local allocations called TLAB.
So, you need synchronization only for allocation a new TLAB when previous is filled. That synchronization is a CAS operation, so it is quite fast.
Eden Survivor 1/2
-------------------------------------------------
| T | | T | || | |
| L | | L | || | |
| A | | A | || | |
| B | | B | || | |
| | | | || | |
| 1 | | 2 | || | |
-------------------------------------------------
^ ^
| |
reserved for|thread-1 allocations
|
|
reserved for thread-2 allocations
Moreover, some optimizations can help you to avoid allocations in compiled code - Escape Analysis and Scalar Replacement. In some scenarios compiler can eliminate allocations by placing all the fields of an object on the stack.

Data segment during execution of program

Doubt:
If we execute a program, the following is the type of memory allocated to that program.
__________________
| |
| stack |
| |
------------------
| |
| <Un Allocated|
| space> |
------------------
| |
| |
| Heap |
| |
| |
__________________
| |
| data |
__________________
| text |
__________________
here the data segment places a vital role. All the initialized data and the uninitialized datas are present in data segment. But, I did not know about the order of storing the data in the data segment. For Ex, the initialized data, uninitialized data, read only and the read write data. I think the above are the four types are present in data segment.
so, in which order the data's will be placed in data segment. Like first intialized data which have the address less than all. And the next is uninitialized data's which have the higher address than the initialized data's like that.
Thanks in Advance.
The order of global variables in the data segment cannot be determined in advance - it is up to your linker and compiler. Normally linkers preserve the order in which variables appear in the linked object files, but this is not a hard requirement (for example, the linker could put double variables first and char last to conserve the required alignment bytes).
Uninitialized global data are generally present in the .bss segment, which is placed after the .data segment (in your picture, "above" it, since higher portions of your picture = larger addresses). The .bss segment is all zeros and only its size is stored in the executable. This way, we don't need to store long strings of zeros in the binary image.

What does #plt mean here?

0x00000000004004b6 <main+30>: callq 0x400398 <printf#plt>
Anyone knows?
UPDATE
Why two disas printf give me different result?
(gdb) disas printf
Dump of assembler code for function printf#plt:
0x0000000000400398 <printf#plt+0>: jmpq *0x2004c2(%rip) # 0x600860 <_GLOBAL_OFFSET_TABLE_+24>
0x000000000040039e <printf#plt+6>: pushq $0x0
0x00000000004003a3 <printf#plt+11>: jmpq 0x400388
(gdb) disas printf
Dump of assembler code for function printf:
0x00000037aa44d360 <printf+0>: sub $0xd8,%rsp
0x00000037aa44d367 <printf+7>: mov %rdx,0x30(%rsp)
0x00000037aa44d36c <printf+12>: movzbl %al,%edx
0x00000037aa44d36f <printf+15>: mov %rsi,0x28(%rsp)
0x00000037aa44d374 <printf+20>: lea 0x0(,%rdx,4),%rax
0x00000037aa44d37c <printf+28>: lea 0x3f(%rip),%rdx # 0x37aa44d3c2 <printf+98>
It's a way to get code fix-ups (adjusting addresses based on where code sits in virtual memory, which may be different across different processes) without having to maintain a separate copy of the code for each process. The PLT, or procedure linkage table, is one of the structures which makes dynamic loading and linking easier to use (another is the GOT, or global offsets table).
Refer to the following diagram, which shows both your calling code and the library code (that you call) mapped to different virtual addresses in two different processes, A and B. There is only one copy of each piece of code in real memory, with the different virtual addresses within each process mapping to that real address):
Process A
Addresses (virtual):
0x1234 0x8888
+-------------+ +---------+ +---------+
| | | Private | | |
| | | PLT/GOT | | |
| Shared | +---------+ | Shared |
===== application =============== library =====
| code | +---------+ | code |
| | | Private | | |
| | | PLT/GOT | | |
+-------------+ +---------+ +---------+
0x2020 0x6666
Process B
When the shared library is brought in to the address space, entries are constructed in the process-specific (private) PLT and/or GOT which will, on first use, perform some fix-up to make things faster. Subsequent usage will then bypass the fix-up as it will no longer be needed.
The process goes something like this.
printf#plt is actually a small stub which (eventually) calls the real printf function, modifying things on the way to make subsequent calls faster.
The real printf function is mapped into an arbitrary location in a given process (virtual address space), as is the code that is trying to call it.
So, in order to allow proper code sharing of calling code (left side above) and called code (right side), you cannot apply any fix-ups to the calling code directly since that will "damage" how it works in the other processes (that wouldn't matter if it mapped to the same location in every process but that's a bit of a restriction, especially if something else had already been mapped there).
So the PLT is a smaller process-specific area at a reliably-calculated-at-runtime address that isn't shared between processes, so any given process is free to change it however it wants to, without adverse effects on other processes.
Let's follow the process through in a bit more detail. The diagram above doesn't show the address of the PLT/GOT since it can be found using a location relative to the current program counter. This is evidenced by your PC-relative lookup:
<printf#plt+0>: jmpq *0x2004c2(%rip) ; 0x600860 <_GOT_+24>
By using position independent code in the called library, along with the PLT/GOT, the first call to the function printf#plt (so in the PLT) is a multi-stage operation, in which the following actions take place:
It calls the GOT version (via a pointer) which initially points back to some set-up code in the PLT.
That set-up code loads the relevant shared library if not yet done, then modifies the GOT pointer so that subsequent calls go directly to the real printf (at the process-specific virtual address) rather than the PLT set-up code.
It then calls the loaded printf code at that address.
On subsequent calls, because the GOT pointer has been modified, the multi-stage approach is simplified:
It calls the GOT version (via pointer), which now points to the real printf.
A good article can be found here, detailing how glibc is loaded at run time.
Not sure, but probably what you have seen makes sense. The first time you run the disas command the printf is not yet called so it's not resolved. Once your program calls the printf method the first time the GOT is updated and now the printf is resolved and the GOT points to the real function. Thus, the next call to the disas command shows the real printf assembly.

Resources