Garbage Collector and concurrent marking in V8 - node.js

I'm reading about V8 GC here. As this new GC uses workers threads to perform concurrent marking I wonder if the overall performance is better when there is > 1 cpus. Will GC run faster ? Has anyone compared both scenarios ?
My app is not clustered.

Yes, you will only get a speed benefit from concurrent operations (in V8 or elsewhere) if you have more than one CPU core.
The actual performance impact depends on the specifics of your app, so you'll have to measure it yourself if you want results that actually apply to your case. As a rough guess, I would expect "a couple percent" of overall throughput difference: most of JavaScript is single-threaded, and in most apps garbage collection accounts for about 2-10% of CPU load.

Related

Is there every a reason to use thread affinity when there are more threads being used than ones specified/reserved?

I am working with Rust but this question would also apply to many other situations.
Suppose you have M available vCPUs and N threads (including the main thread) to schedule, and that N > M. Each thread does approximately equal amounts of work.
Is there any good reason then to pin threads to specific cores? I've written a number of toy benchmarks and it seems like the answer is no, as I cannot make a program under these assumptions that performs better with thread affinity; in fact, it always does much worse.
if your application is working on a system with a lot of cores and heavily relies on the core cache, a context switch will be too expensive, so pinning tasks to cores reduces the context switches and improves throughput.
but in an "average pc" running plain RAM-bound tasks then your OS scheduler will be much better at load balancing the cores than you ever will.
pinning threads to cores is also useful if you care about latency instead of throughput, in a heavily loaded system if you have a time-critical task then you want it to have its own core which won't be contented by other tasks on the system, hence it makes sense to pin it to a certain core, an example will be an in-memory Database that needs to responds to request in under a millisecond latency.
so the answer is, it's only useful for certain apps.

How to monitor contention from competing hyperthreads

Is there some performance counter or similar available which can help quantify the extent to which two hyperthreads on the same core are competing with each other for execution resources?
Use case: in a shared infrastructure environment like Kubernetes, I may be able to request that my process has "1 CPU" of capacity. However with hyperthreading in play this is not a physical core but a logical CPU, and the throughput of that core may vary by as much as a factor of 2 depending on what's scheduled to run on the other logical CPU corresponding to the other hyperthread of the same physical core. Is there some metric which can quantify how much time has been spent stalled waiting on the other hyperthread?
Another use case: depending on the nature of the workload and how much it stands to gain from hyperthreading, average CPU utilization may not be a good indication of available capacity. For a workload that gets no benefit from hyperthreading at all, loading the host beyond 50% CPU utilization may not increase throughput at all. What metrics could I use to determine how much real execution resource, and not just idle logical CPU time, is available?
Probably, as everything is in source code, you can implement such a counter....
Of course, implementation depends on what you do want to monitor,
a counter per shared resource,
a global counter per function,
a counter per operation type (lock/unlock)
Anyway, a counter (in/de)crement must be atomic in essence, and should be contained also (a spinlock?), making your code slower, so that's several reasons to not include them by default.

Java 7 G1GC strange behaviour

Recently I have tried to use G1GC from jdk1.7.0-17 in my java processor which is processing a lot of similar messages received from an MQ (about 15-20 req/sec). Every message is processed in the separate thread (about 100 threads in stable state) that serviced by Java limited thread pool. Surprisingly, I detected the strange behaviour - as soon as GC starts the full gc cycle it begins to use significant processing time (up to 100% CPU and even more). I was doing refactoring of the code several times having a goal to optimizing it and doing it more lightweight. But without any significant result - the behaviour is the same. I use the 4-core 64-bit machine with Debian OS (2.6.32-5 kernel). May someone help me to understand and resolve the situation?
Below are depicted some illustrations for listed above issue.
Surprisingly, I detected the strange behaviour - as soon as GC starts
the full gc cycle...
Unfortunately, this is not a surprise because for the G1 GC implemented within the JVM uses just one hardware thread (vCPU) to execute the Full GC so the idea is to minimize the number of Full GCs. Please, you should keep in mind this collector is recommended for configurations with several cores (of course it does not impact on the Full GC, but impacts on allocation and parallel collections) and big heaps I think bigger than 8GB.
According to Oracle:
https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/g1_gc.html
The Garbage-First (G1) garbage collector is a server-style garbage
collector, targeted for multiprocessor machines with large memories.
It attempts to meet garbage collection (GC) pause time goals with high
probability while achieving high throughput. Whole-heap operations,
such as global marking, are performed concurrently with the
application threads. This prevents interruptions proportional to heap
or live-data size.
In this article there is an explanation about the Full GC single thread in this collector.
https://www.redhat.com/en/blog/part-1-introduction-g1-garbage-collector
Finally and unfortunately, G1 also has to deal with the dreaded Full
GC. While G1 is ultimately trying to avoid Full GC’s, they are still a
harsh reality especially in improperly tuned environments. Given that
G1 is targeting larger heap sizes, the impact of a Full GC can be
catastrophic to in-flight processing and SLAs. One of the primary
reasons is that Full GCs are still a single-threaded operation in G1.
Looking at causes, the first, and most avoidable, is related to
Metaspace.
By the way, it seems to be the newest version of Java (10) is going to include a G1 with the capability of executing Full GCs in parallel.
https://www.opsian.com/blog/java-10-with-g1/
Java 10 reduces Full GC pause times by iteratively improving on its
existing algorithm. Until Java 10 G1 Full GCs ran in a single thread.
That’s right - your 32 core server and it’s 128GB will stop and pause
until a single thread takes out the garbage.
Perhaps, you should tune the metaspace or increase the heap or you can use other collector such as the parallel GC.

CPU usage vs Number of threads

In general what is the relation between CPU usage and number of threads in a program.
Assumptions:
Multi-core CPU
Threads do the exact same job (assume they fetch identical work items from a queue and process them)
It depends on the nature of the application.
An application that mostly do calculations - a ratio of 1 thread per
core is a reasonable decision, since you don't want to spawn too many threads due to overhead, and you want to take advantage of all your cores.
An application that mostly do IO operations (like http requests) can spawn much more threads then the #cores and still increase efficiency, since the bottleneck is the waiting time per IO request, and you want to gain as much information as possible in each time you need to wait.
That said, the CPU-usage you are going to get is still dependent on many factors (IO, synchronization, non parallel parts in your program).
If you are interested in the speed the application will take - always remember Amdahl's law, which gives you a strict bound on the time (speed-up) your application is going to take, even when having infinite number of working cores.
There is no such general relationship, except for the obvious ones:
an application can't use more CPU time (CPU seconds) than the number of available cores multiplied by the number of (wall clock) seconds that it runs, and
a single thread can't use more than one CPU second per second.
The actual amount of CPU that a multi-threaded application depends mostly on the nature of the application, and the way that you've implemented it:
If the computation performed by each thread does not generate contention with other threads for locks, memory access and so on, then you should be able to approach the theoretical limit of available CPU resources.
Contention is liable to reduce effective CPU usage, sometimes dramatically.
But there are no general formulae that will tell you how much speed-up you can get.
I think there is no relation or not easy one. It depends on the jobs the threads are doing. A program with one thread can consume 100% of CPU and a program with lots of threads can consume less.
If you are looking for an optimized relation between threads and job done, you must study your case, and possibly found an empiric solution.
As the other answers already state, "it depends". In an ideal world, for n cores, you would get a throughput of factor n, given that you do the same job in a separate thread on each core (which already contains a false assumption, since you need to somehow synchronize the threads when they read from the same queue).
Understanding the Disruptor, a Beginner's Guide to Hardcore Concurrency gives some nice examples what you need to consider when parallezing tasks, and also shows some cases where the attempt to parallelize leads to a longer execution time.

What kind of Garbage Collection does Go use?

Go is a garbage collected language:
http://golang.org/doc/go_faq.html#garbage_collection
Here it says that it's a mark-and-sweep garbage collector, but it doesn't delve into details, and a replacement is in the works... yet, this paragraph seems not to have been updated much since Go was released.
It's still mark-and-sweep? Is it conservative or precise? Is it generational?
Plans for Go 1.4+ garbage collector:
hybrid stop-the-world/concurrent collector
stop-the-world part limited by a 10ms deadline
CPU cores dedicated to running the concurrent collector
tri-color mark-and-sweep algorithm
non-generational
non-compacting
fully precise
incurs a small cost if the program is moving pointers around
lower latency, but most likely also lower throughput, than Go 1.3 GC
Go 1.3 garbage collector updates on top of Go 1.1:
concurrent sweep (results in smaller pause times)
fully precise
Go 1.1 garbage collector:
mark-and-sweep (parallel implementation)
non-generational
non-compacting
mostly precise (except stack frames)
stop-the-world
bitmap-based representation
zero-cost when the program is not allocating memory (that is: shuffling pointers around is as fast as in C, although in practice this runs somewhat slower than C because the Go compiler is not as advanced as C compilers such as GCC)
supports finalizers on objects
there is no support for weak references
Go 1.0 garbage collector:
same as Go 1.1, but instead of being mostly precise the garbage collector is conservative. The conservative GC is able to ignore objects such as []byte.
Replacing the GC with a different one is controversial, for example:
except for very large heaps, it is unclear whether a generational GC would be faster overall
package "unsafe" makes it hard to implement fully precise GC and compacting GC
(For Go 1.8 - Q1 2017, see below)
The next Go 1.5 concurrent Garbage Collector involve being able to "pace" said gc.
Here is a proposal presented in this paper which might make it for Go 1.5, but also helps understand the gc in Go.
You can see the state before 1.5 (Stop The World: STW)
Prior to Go 1.5, Go has used a parallel stop-the-world (STW) collector.
While STW collection has many downsides, it does at least have predictable and controllable heap growth behavior.
(Photo from GopherCon 2015 presentation "Go GC: Solving the Latency Problem in Go 1.5")
The sole tuning knob for the STW collector was “GOGC”, the relative heap growth between collections. The default setting, 100%, triggered garbage collection every time the heap size doubled over the live heap size as of the previous collection:
GC timing in the STW collector.
Go 1.5 introduces a concurrent collector.
This has many advantages over STW collection, but it makes heap growth harder to control because the application can allocate memory while the garbage collector is running.
(Photo from GopherCon 2015 presentation "Go GC: Solving the Latency Problem in Go 1.5")
To achieve the same heap growth limit the runtime must start garbage collection earlier, but how much earlier depends on many variables, many of which cannot be predicted.
Start the collector too early, and the application will perform too many garbage collections, wasting CPU resources.
Start the collector too late, and the application will exceed the desired maximum heap growth.
Achieving the right balance without sacrificing concurrency requires carefully pacing the garbage collector.
GC pacing aims to optimize along two dimensions: heap growth, and CPU utilized by the garbage collector.
The design of GC pacing consists of four components:
an estimator for the amount of scanning work a GC cycle will require,
a mechanism for mutators to perform the estimated amount of scanning work by the time heap allocation reaches the heap goal,
a scheduler for background scanning when mutator assists underutilize the CPU budget, and
a proportional controller for the GC trigger.
The design balances two different views of time: CPU time and heap time.
CPU time is like standard wall clock time, but passes GOMAXPROCS times faster.
That is, if GOMAXPROCS is 8, then eight CPU seconds pass every wall second and GC gets two seconds of CPU time every wall second.
The CPU scheduler manages CPU time.
The passage of heap time is measured in bytes and moves forward as mutators allocate.
The relationship between heap time and wall time depends on the allocation rate and can change constantly.
Mutator assists manage the passage of heap time, ensuring the estimated scan work has been completed by the time the heap reaches the goal size.
Finally, the trigger controller creates a feedback loop that ties these two views of time together, optimizing for both heap time and CPU time goals.
This is the implementation of the GC:
https://github.com/golang/go/blob/master/src/runtime/mgc.go
From the docs in the source:
The GC runs concurrently with mutator threads, is type accurate (aka precise), allows multiple GC thread to run in parallel. It is a concurrent mark and sweep that uses a write barrier. It is non-generational and non-compacting. Allocation is done using size segregated per P allocation areas to minimize fragmentation while eliminating locks in the common case.
Go 1.8 GC might evolve again, with the proposal "Eliminate STW stack re-scanning"
As of Go 1.7, the one remaining source of unbounded and potentially non-trivial stop-the-world (STW) time is stack re-scanning.
We propose to eliminate the need for stack re-scanning by switching to a hybrid write barrier that combines a Yuasa-style deletion write barrier [Yuasa '90] and a Dijkstra-style insertion write barrier [Dijkstra '78].
Preliminary experiments show that this can reduce worst-case STW time to under 50µs, and this approach may make it practical to eliminate STW mark termination altogether.
The announcement is here and you can see the relevant source commit is d70b0fe and earlier.
I'm not sure, but I think the current (tip) GC is already a parallel one or at least it's a WIP. Thus the stop-the-world property doesn't apply any more or will not in the near future. Perhaps someone other can clarify this in more detail.

Resources