Definition of GCT (Total garbage collection time) - garbage-collection

The "Total garbage collection time" can be determined by observing the GCT column printed by the command:
jstat -gc <pid>
as described by the documentation here: https://docs.oracle.com/en/java/javase/12/tools/jstat.html
It appears to be an amount of time spent doing GC since the Java process started, measured in seconds.
Is that per core? So, if a quad-core CPU was fully utilized by a single JVM instance for 100 seconds, and was garbage collecting 10% of the time, would GCT report 10 or 40? If I have hyperthreading enabled (i.e. 8 OS cores), then how should I reason about the GCT figure?
I'm using the OpenJDK12 HotSpot JVM.

GCT metric is collector specific. Each GC algorithm may define its own meaning for GCT.
Usually it is the total wall clock time spent in the stop-the-world GC pauses. In particular, for G1, the default collector in OpenJDK 12, GCT is the sum of Full GC pauses + stop-the-world phases of concurrent GC cycles.
However, for Shenandoah GC GCT also includes the concurrent GC time. In either case, GCT is measured with an absolute (wall clock) timer, irrespective of the number of threads or CPU cores.

Related

Threads vs cores when threads are asleep

I am looking to confirm my assumptions about threads and CPU cores.
All the threads are the same. No disk I/O is used, threads do not share memory, and each thread does CPU bound work only.
If I have CPU with 10 cores, and I spawn 10 threads, each thread will have its own core and run simultaneously.
If I launch 20 threads with a CPU that has 10 cores, then the 20 threads will "task switch" between the 10 cores, giving each thread approximately 50% of the CPU time per core.
If I have 20 threads but 10 of the threads are asleep, and 10 are active, then the 10 active threads will run at 100% of the CPU time on the 10 cores.
An thread that is asleep only costs memory, and not CPU time. While the thread is still asleep. For example 10,000 threads that are all asleep uses the same amount of CPU as 1 thread asleep.
In general if you have a series of threads that sleep frequently while working on a parallel process. You can add more threads then there are cores until get to a state where all the cores are busy 100% of the time.
Are any of my assumptions incorrect? if so why?
Edit
When I say the thread is asleep, I mean that the thread is blocked for a specific amount of time. In C++ I would use sleep_for Blocks the execution of the current thread for at least the specified sleep_duration
If we assume that you are talking about threads that are implemented using native thread support in a modern OS, then your statements are more or less correct.
There are a few factors that could cause the behavior to deviate from the "ideal".
If there are other user-space processes, they may compete for resources (CPU, memory, etcetera) with your application. That will reduce (for example) the CPU available to your application. Note that this will include things like the user-space processes responsible for running your desktop environment etc.
There are various overheads that will be incurred by the operating system kernel. There are many places where this happens including:
Managing the file system.
Managing physical / virtual memory system.
Dealing with network traffic.
Scheduling processes and threads.
That will reduce the CPU available to your application.
The thread scheduler typically doesn't do entirely fair scheduling. So one thread may get a larger percentage of the CPU than another.
There are some complicated interactions with the hardware when the application has a large memory footprint, and threads don't have good memory locality. For various reasons, memory intensive threads compete with each other and can slow each other down. These interactions are all accounted as "user process" time, but they result in threads being able to do less actual work.
So:
1) If I have CPU with 10 cores, and I spawn 10 threads, each thread will have its own core and run simultaneously.
Probably not all of the time, due to other user processes and OS overheads.
2) If I launch 20 threads with a CPU that has 10 cores, then the 20 threads will "task switch" between the 10 cores, giving each thread approximately 50% of the CPU time per core.
Approximately. There are the overheads (see above). There is also the issue that time slicing between different threads of the same priority is fairly coarse grained, and not necessarily fair.
3) If I have 20 threads but 10 of the threads are asleep, and 10 are active, then the 10 active threads will run at 100% of the CPU time on the 10 cores.
Approximately: see above.
4) An thread that is asleep only costs memory, and not CPU time. While the thread is still asleep. For example 10,000 threads that are all asleep uses the same amount of CPU as 1 thread asleep.
There is also the issue that the OS consumes CPU to manage the sleeping threads; e.g. putting them to sleep, deciding when to wake them, rescheduling.
Another one is that the memory used by the threads may also come at a cost. For instance if the sum of the memory used for all process (including all of the 10,000 threads' stacks) is larger than the available physical RAM, then there is likely to be paging. And that also uses CPU resources.
5) In general if you have a series of threads that sleep frequently while working on a parallel process. You can add more threads then there are cores until get to a state where all the cores are busy 100% of the time.
Not necessarily. If the virtual memory usage is out of whack (i.e. you are paging heavily), the system may have to idle some of the CPU while waiting for memory pages to be read from and written to the paging device. In short, you need to take account of memory utilization, or it will impact on the CPU utilization.
This also doesn't take account of thread scheduling and context switching between threads. Each time the OS switches a core from one thread to another it has to:
Save the the old thread's registers.
Flush the processor's memory cache
Invalidate the VM mapping registers, etcetera. This includes the TLBs that #bazza mentioned.
Load the new thread's registers.
Take performance hits due to having to do more main memory reads, and vm page translations because of previous cache invalidations.
These overheads can be significant. According to https://unix.stackexchange.com/questions/506564/ this is typically around 1.2 microseconds per context switch. That may not sound much, but if your application is switching threads rapidly, that could amount to many milliseconds in each second.
As already mentioned in the comments, it depends on a number of factors. But in a general sense your assumptions are correct.
Sleep
In the bad old days a sleep() might have been implemented by the C library as a loop doing pointless work (e.g. multiplying 1 by 1 until the required time had elapsed). In that case, the CPU would still be 100% busy. Nowadays a sleep() will actually result in the thread being descheduled for the requisite time. Platforms such as MS-DOS worked this way, but any multitasking OS has had a proper implementation for decades.
10,000 sleeping threads will take up more CPU time, because the OS has to make scheduling judgements every timeslice tick (every 60ms, or thereabouts). The more threads it has to check for being ready to run, the more CPU time that checking takes.
Translate Lookaside Buffers
Adding more threads than cores is generally seen as OK. But you can run into a problem with Translate Lookaside Buffers (or their equivalents on other CPUs). These are part of the virtual memory management side of the CPU, and they themselves are effectively content address memory. This is really hard to implement, so there's never that much of it. Thus the more memory allocations there are (which there will be if you add more and more threads) the more this resource is eaten up, to the point where the OS may have to start swapping in and out different loadings of the TLB in order for all the virtual memory allocations to be accessible. If this starts happenging, everything in the process becomes really, really slow. This is likely less of a problem these days than it was, say, 20 years ago.
Also, modern memory allocators in C libraries (and thence everything else built on top, e.g. Java, C#, the lot) will actually be quite careful in how requests for virtual memory are managed, minising the times they actually have to as the OS for more virtual memory. Basically they seek to provide requested allocations out of pools they've already got, rather than each malloc() resulting in a call to the OS. This takes the pressure of the TLBs.

what exactly is cumulative CPU time

While working with a tomcat process in Linux we observed that the time field shows
5506:34 ( cumulative CPU time ) . While exploring this is the CPU percentage of time spent running during the entire lifetime of a process.
Since this is a Java process we also observed that memory was almost full and needed a restart.
My Question is what exactly is this Cumulative CPU time. Why does this specific process taking more CPU time when there are other process too ?
the total time the cpu spends on a process. If the process uses more threads, these are cumulated.

Big difference between Elapsed Time and CPU Time

VTune hotspots analysis reports my program's execution time (elapsed time) was 60 seconds out of which only 10 seconds are reported as "CPU Time". I'm trying to where the remaining 50 seconds was spent. Using Windows Process Monitor's File System Activity, I see my program spent 5 seconds doing disk I/O. This still leaves 45 seconds unaccounted for.
There are two threads in my program, according to VTune, one of those thread consumed 99% of the CPU Time. I don't see how these two threads given their execution profiles could explain the lost time.
Any thoughts?

Java 7 G1GC strange behaviour

Recently I have tried to use G1GC from jdk1.7.0-17 in my java processor which is processing a lot of similar messages received from an MQ (about 15-20 req/sec). Every message is processed in the separate thread (about 100 threads in stable state) that serviced by Java limited thread pool. Surprisingly, I detected the strange behaviour - as soon as GC starts the full gc cycle it begins to use significant processing time (up to 100% CPU and even more). I was doing refactoring of the code several times having a goal to optimizing it and doing it more lightweight. But without any significant result - the behaviour is the same. I use the 4-core 64-bit machine with Debian OS (2.6.32-5 kernel). May someone help me to understand and resolve the situation?
Below are depicted some illustrations for listed above issue.
Surprisingly, I detected the strange behaviour - as soon as GC starts
the full gc cycle...
Unfortunately, this is not a surprise because for the G1 GC implemented within the JVM uses just one hardware thread (vCPU) to execute the Full GC so the idea is to minimize the number of Full GCs. Please, you should keep in mind this collector is recommended for configurations with several cores (of course it does not impact on the Full GC, but impacts on allocation and parallel collections) and big heaps I think bigger than 8GB.
According to Oracle:
https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/g1_gc.html
The Garbage-First (G1) garbage collector is a server-style garbage
collector, targeted for multiprocessor machines with large memories.
It attempts to meet garbage collection (GC) pause time goals with high
probability while achieving high throughput. Whole-heap operations,
such as global marking, are performed concurrently with the
application threads. This prevents interruptions proportional to heap
or live-data size.
In this article there is an explanation about the Full GC single thread in this collector.
https://www.redhat.com/en/blog/part-1-introduction-g1-garbage-collector
Finally and unfortunately, G1 also has to deal with the dreaded Full
GC. While G1 is ultimately trying to avoid Full GC’s, they are still a
harsh reality especially in improperly tuned environments. Given that
G1 is targeting larger heap sizes, the impact of a Full GC can be
catastrophic to in-flight processing and SLAs. One of the primary
reasons is that Full GCs are still a single-threaded operation in G1.
Looking at causes, the first, and most avoidable, is related to
Metaspace.
By the way, it seems to be the newest version of Java (10) is going to include a G1 with the capability of executing Full GCs in parallel.
https://www.opsian.com/blog/java-10-with-g1/
Java 10 reduces Full GC pause times by iteratively improving on its
existing algorithm. Until Java 10 G1 Full GCs ran in a single thread.
That’s right - your 32 core server and it’s 128GB will stop and pause
until a single thread takes out the garbage.
Perhaps, you should tune the metaspace or increase the heap or you can use other collector such as the parallel GC.

What kind of Garbage Collection does Go use?

Go is a garbage collected language:
http://golang.org/doc/go_faq.html#garbage_collection
Here it says that it's a mark-and-sweep garbage collector, but it doesn't delve into details, and a replacement is in the works... yet, this paragraph seems not to have been updated much since Go was released.
It's still mark-and-sweep? Is it conservative or precise? Is it generational?
Plans for Go 1.4+ garbage collector:
hybrid stop-the-world/concurrent collector
stop-the-world part limited by a 10ms deadline
CPU cores dedicated to running the concurrent collector
tri-color mark-and-sweep algorithm
non-generational
non-compacting
fully precise
incurs a small cost if the program is moving pointers around
lower latency, but most likely also lower throughput, than Go 1.3 GC
Go 1.3 garbage collector updates on top of Go 1.1:
concurrent sweep (results in smaller pause times)
fully precise
Go 1.1 garbage collector:
mark-and-sweep (parallel implementation)
non-generational
non-compacting
mostly precise (except stack frames)
stop-the-world
bitmap-based representation
zero-cost when the program is not allocating memory (that is: shuffling pointers around is as fast as in C, although in practice this runs somewhat slower than C because the Go compiler is not as advanced as C compilers such as GCC)
supports finalizers on objects
there is no support for weak references
Go 1.0 garbage collector:
same as Go 1.1, but instead of being mostly precise the garbage collector is conservative. The conservative GC is able to ignore objects such as []byte.
Replacing the GC with a different one is controversial, for example:
except for very large heaps, it is unclear whether a generational GC would be faster overall
package "unsafe" makes it hard to implement fully precise GC and compacting GC
(For Go 1.8 - Q1 2017, see below)
The next Go 1.5 concurrent Garbage Collector involve being able to "pace" said gc.
Here is a proposal presented in this paper which might make it for Go 1.5, but also helps understand the gc in Go.
You can see the state before 1.5 (Stop The World: STW)
Prior to Go 1.5, Go has used a parallel stop-the-world (STW) collector.
While STW collection has many downsides, it does at least have predictable and controllable heap growth behavior.
(Photo from GopherCon 2015 presentation "Go GC: Solving the Latency Problem in Go 1.5")
The sole tuning knob for the STW collector was “GOGC”, the relative heap growth between collections. The default setting, 100%, triggered garbage collection every time the heap size doubled over the live heap size as of the previous collection:
GC timing in the STW collector.
Go 1.5 introduces a concurrent collector.
This has many advantages over STW collection, but it makes heap growth harder to control because the application can allocate memory while the garbage collector is running.
(Photo from GopherCon 2015 presentation "Go GC: Solving the Latency Problem in Go 1.5")
To achieve the same heap growth limit the runtime must start garbage collection earlier, but how much earlier depends on many variables, many of which cannot be predicted.
Start the collector too early, and the application will perform too many garbage collections, wasting CPU resources.
Start the collector too late, and the application will exceed the desired maximum heap growth.
Achieving the right balance without sacrificing concurrency requires carefully pacing the garbage collector.
GC pacing aims to optimize along two dimensions: heap growth, and CPU utilized by the garbage collector.
The design of GC pacing consists of four components:
an estimator for the amount of scanning work a GC cycle will require,
a mechanism for mutators to perform the estimated amount of scanning work by the time heap allocation reaches the heap goal,
a scheduler for background scanning when mutator assists underutilize the CPU budget, and
a proportional controller for the GC trigger.
The design balances two different views of time: CPU time and heap time.
CPU time is like standard wall clock time, but passes GOMAXPROCS times faster.
That is, if GOMAXPROCS is 8, then eight CPU seconds pass every wall second and GC gets two seconds of CPU time every wall second.
The CPU scheduler manages CPU time.
The passage of heap time is measured in bytes and moves forward as mutators allocate.
The relationship between heap time and wall time depends on the allocation rate and can change constantly.
Mutator assists manage the passage of heap time, ensuring the estimated scan work has been completed by the time the heap reaches the goal size.
Finally, the trigger controller creates a feedback loop that ties these two views of time together, optimizing for both heap time and CPU time goals.
This is the implementation of the GC:
https://github.com/golang/go/blob/master/src/runtime/mgc.go
From the docs in the source:
The GC runs concurrently with mutator threads, is type accurate (aka precise), allows multiple GC thread to run in parallel. It is a concurrent mark and sweep that uses a write barrier. It is non-generational and non-compacting. Allocation is done using size segregated per P allocation areas to minimize fragmentation while eliminating locks in the common case.
Go 1.8 GC might evolve again, with the proposal "Eliminate STW stack re-scanning"
As of Go 1.7, the one remaining source of unbounded and potentially non-trivial stop-the-world (STW) time is stack re-scanning.
We propose to eliminate the need for stack re-scanning by switching to a hybrid write barrier that combines a Yuasa-style deletion write barrier [Yuasa '90] and a Dijkstra-style insertion write barrier [Dijkstra '78].
Preliminary experiments show that this can reduce worst-case STW time to under 50µs, and this approach may make it practical to eliminate STW mark termination altogether.
The announcement is here and you can see the relevant source commit is d70b0fe and earlier.
I'm not sure, but I think the current (tip) GC is already a parallel one or at least it's a WIP. Thus the stop-the-world property doesn't apply any more or will not in the near future. Perhaps someone other can clarify this in more detail.

Resources