JVM don't call gc, why G1 Old Gen memory decreased? - garbage-collection

These days, I found my service's memory rises a little every day, like this pic:
then, I opened JVM monitor page, and found that no FULL GC occured in these days:
so, my question: Why JVM don't call FULL GC the last 5 days,
and why G1 Old Gen memory growth and last decreased?
thands a lot.
btw: my jar execute command was:
java -server -XX:+UseContainerSupport -XX:InitialRAMPercentage=40.0 -XX:MinRAMPercentage=10.0 -XX:MaxRAMPercentage=85.0 -XX:NewRatio=2 -XX:MetaspaceSize=256m -XX:MaxMetaspaceSize=512m -XX:MaxDirectMemorySize=512m -Dlog4j2.formatMsgNoLookups=true -Dfile.encoding=utf-8 -Dserver.port=80 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/init/configure/app/dumps -XX:+ExitOnOutOfMemoryError -XX:OnOutOfMemoryError=/init/configure/app/s3-client -javaagent:/init/configure/app/jmx_javaagent-0.16.1.jar=9009:/init/configure/app/config.yaml -Dspring.profiles.active=prod -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=10 -XX:-ShrinkHeapInSteps -jar /opt/main.jar

Related

Inconsistent Tomcat response time

My Tomcat (8.5.8, behind Apache HTTPD) sometimes (< 1 percent) has high waited time (result from Glowroot).
JVM Thread Stats
CPU time: 422.3 milliseconds
Blocked time: 0.0 milliseconds
Waited time: 26,576.0 milliseconds
Allocated memory: 73.3 MB
The provided stack trace shows Tomcat was waiting for a latch. At that time my Tomcat host memory, CPU, disk IO, garbage collection looks good.
I tried to issue the same request and same data (compressed size 2.5MB, original size about 25MB) was returned. Most of the times were okay (< 2 seconds) but few times it took long (> 20 seconds).
com.fasterxml.jackson.core.json.UTF8JsonGenerator.writeString(UTF8JsonGenerator.java:432)
com.fasterxml.jackson.core.json.UTF8JsonGenerator._writeStringSegments(UTF8JsonGenerator.java:1148)
com.fasterxml.jackson.core.json.UTF8JsonGenerator._flushBuffer(UTF8JsonGenerator.java:2003)
org.springframework.security.web.context.OnCommittedResponseWrapper$SaveContextServletOutputStream.write(OnCommittedResponseWrapper.java:540)
org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:96)
org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:369)
org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:391)
org.apache.catalina.connector.OutputBuffer.append(OutputBuffer.java:713)
org.apache.catalina.connector.OutputBuffer.flushByteBuffer(OutputBuffer.java:808)
org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:351)
org.apache.coyote.Response.doWrite(Response.java:517)
org.apache.coyote.ajp.AjpProcessor$SocketOutputBuffer.doWrite(AjpProcessor.java:1469)
org.apache.coyote.ajp.AjpProcessor.access$900(AjpProcessor.java:54)
org.apache.coyote.ajp.AjpProcessor.writeData(AjpProcessor.java:1353)
org.apache.tomcat.util.net.SocketWrapperBase.write(SocketWrapperBase.java:361)
org.apache.tomcat.util.net.SocketWrapperBase.writeBlocking(SocketWrapperBase.java:419)
org.apache.tomcat.util.net.SocketWrapperBase.doWrite(SocketWrapperBase.java:670)
org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite(NioEndpoint.java:1259)
org.apache.tomcat.util.net.NioSelectorPool.write(NioSelectorPool.java:157)
org.apache.tomcat.util.net.NioBlockingSelector.write(NioBlockingSelector.java:114)
org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.awaitWriteLatch(NioEndpoint.java:1109)
org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.awaitLatch(NioEndpoint.java:1106)
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
sun.misc.Unsafe.park(Native Method)
TIMED_WAITING
The waiting is due to what reason? Connectivity issue between client? Slow client? Connectivity issue between Apache HTTPD and Tomcat? Not enough file descriptor?
What is the latch at here?

PostgreSQL out of memory: Linux OOM killer

I am having issues with a large query, that I expect to rely on wrong configs of my postgresql.config. My setup is PostgreSQL 9.6 on Ubuntu 17.10 with 32GB RAM and 3TB HDD. The query is running pgr_dijkstraCost to create an OD-Matrix of ~10.000 points in a network of 25.000 links. Resulting table is thus expected to be very big ( ~100'000'000 rows with columns from, to, costs). However, creating simple test as select x,1 as c2,2 as c3
from generate_series(1,90000000) succeeds.
The query plan:
QUERY PLAN
--------------------------------------------------------------------------------------
Function Scan on pgr_dijkstracost (cost=393.90..403.90 rows=1000 width=24)
InitPlan 1 (returns $0)
-> Aggregate (cost=196.82..196.83 rows=1 width=32)
-> Seq Scan on building_nodes b (cost=0.00..166.85 rows=11985 width=4)
InitPlan 2 (returns $1)
-> Aggregate (cost=196.82..196.83 rows=1 width=32)
-> Seq Scan on building_nodes b_1 (cost=0.00..166.85 rows=11985 width=4)
This leads to a crash of PostgreSQL:
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
normally and possibly corrupted shared memory.
Running dmesg I could trace it down to be an Out of memory issue:
Out of memory: Kill process 5630 (postgres) score 949 or sacrifice child
[ 5322.821084] Killed process 5630 (postgres) total-vm:36365660kB,anon-rss:32344260kB, file-rss:0kB, shmem-rss:0kB
[ 5323.615761] oom_reaper: reaped process 5630 (postgres), now anon-rss:0kB,file-rss:0kB, shmem-rss:0kB
[11741.155949] postgres invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0
[11741.155953] postgres cpuset=/ mems_allowed=0
When running the query I also can observe with topthat my RAM is going down to 0 before the crash. The amount of committed memory just before the crash:
$grep Commit /proc/meminfo
CommitLimit: 18574304 kB
Committed_AS: 42114856 kB
I would expect the HDD is used to write/buffer temporary data, when RAM is not enough. But the available space on my hdd does not change during the processing. So I began to dig for missing configs (expecting issues due to my relocated data-directory) and following different sites:
https://www.postgresql.org/docs/current/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
https://www.credativ.com/credativ-blog/2010/03/postgresql-and-linux-memory-management
My original settings of postgresql.conf are default except for changes in the data-directory:
data_directory = '/hdd_data/postgresql/9.6/main'
shared_buffers = 128MB # min 128kB
#huge_pages = try # on, off, or try
#temp_buffers = 8MB # min 800kB
#max_prepared_transactions = 0 # zero disables the feature
#work_mem = 4MB # min 64kB
#maintenance_work_mem = 64MB # min 1MB
#replacement_sort_tuples = 150000 # limits use of replacement selection sort
#autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem
#max_stack_depth = 2MB # min 100kB
dynamic_shared_memory_type = posix # the default is the first option
I changed the config:
shared_buffers = 128MB
work_mem = 40MB # min 64kB
maintenance_work_mem = 64MB
Relaunched with sudo service postgresql reload and tested the same query, but found no change in behavior. Does this simply mean, such a large query can not be done? Any help appreciated.
I'm having similar trouble, but not with PostgreSQL (which is running happily): what is happening is simply that the kernel cannot allocate more RAM to the process, whichever process it is.
It would certainly help to add some swap to your configuration.
To check how much RAM and swap you have, run: free -h
On my machine, here is what it returns:
total used free shared buff/cache available
Mem: 7.7Gi 5.3Gi 928Mi 865Mi 1.5Gi 1.3Gi
Swap: 9.4Gi 7.1Gi 2.2Gi
You can clearly see that my machine is quite overloaded: about 8Gb of RAM, and 9Gb of swap, from which 7 are used.
When the RAM-hungry process got killed after Out of memory, I saw both RAM and swap being used at 100%.
So, allocating more swap may improve our problems.

ActiveMQ uses 100% CPU

I use AMQ v 5.9.0 with non-persistent message and below settings:
<policyEntry queue="foo.bar.>" memoryLimit="500mb" producerFlowControl="false">
<pendingQueuePolicy>
<fileQueueCursor />
</pendingQueuePolicy>
</policyEntry>
Heap Size below
-Xmx and Xmx set up 1GB/1GB.
My problem is when I send (for example 300 000 messages) into my queue and AMQ initialise KahaDB "Temp percent used" > 0 I leave it for a night and in the next day I send another messages (10 000 for example). I noticed that my CPU increase to 100%, memory about 80-90% and AMQ console is freezing. This situation is all the time, when I wait a night. I was looking for some information why is this happening, but I didn't find anything.
Maybe anyone knows what is wrong?
If you use a version of openJDK lower than 9 you should try to upgrade your openJDK version.
The problem explaination here
The openJDK issue here

Self-hosted WCF service maxing cpu load

I am looking into an issue at work with a WindowsService that is taking 100% CPU on a machine with 16 CPU's.
The service is hosting a self-hosted .NET WCF service.
I have received a crash dump which I have loaded up in windbg, in order to look for clues.
So what I have tried:
!threads :
ThreadCount: 646
UnstartedThread: 0
BackgroundThread: 643
PendingThread: 0
DeadThread: 2
Hosted Runtime: no
642 of the threads were Threadpool workers as following:
8 29 2a34 000000002068b510 3029220 Preemptive 0000000000000000:0000000000000000 0000000000563f50 0 MTA (Threadpool Worker)
~29s -> !CLRStack
000000003c66eb70 00000000770512fa [GCFrame: 000000003c66eb70]
000000003c66ec40 00000000770512fa [GCFrame: 000000003c66ec40]
000000003c66ec78 00000000770512fa [HelperMethodFrame: 000000003c66ec78] System.Threading.Monitor.Enter(System.Object)
000000003c66ed70 000007fef7af1c9c System.Threading.TimerQueueTimer.Fire()
000000003c66ede0 000007fef7a6c2f3 System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
000000003c66ee30 000007fef7a6c92a System.Threading.ThreadPoolWorkQueue.Dispatch()
000000003c66f388 000007fef8d57d33 [DebuggerU2MCatchHandlerFrame: 000000003c66f388]
~29s -> K
000000003c66e858 000007fefd7010dc ntdll!NtWaitForSingleObject+0xa
000000003c66e860 000007fef8d049bf KERNELBASE!WaitForSingleObjectEx+0x79
000000003c66e900 000007fef8d04977 clr!CLREventBase::WaitEx+0x16c
000000003c66e940 000007fef8d048f8 clr!CLREventBase::WaitEx+0x103
000000003c66e9a0 000007fef8e9c5de clr!CLREventBase::WaitEx+0x70
000000003c66ea30 000007fef8dc5a34 clr!WKS::GCHeap::WaitUntilGCComplete+0x2b
000000003c66ea60 000007fef8d0c4f4 clr!Thread::RareDisablePreemptiveGC+0x176
000000003c66eaf0 000007fef8dd1f3d clr!GCCoop::GCCoop+0x3d
000000003c66eb20 000007fef8e898cf clr!AwareLock::Contention+0x137
000000003c66ebe0 000007fef7af1c9c clr!JITutil_MonContention+0xaf
000000003c66ed70 000007fef7a6c2f3 mscorlib_ni+0x521c9c
000000003c66ede0 000007fef7a6c92a mscorlib_ni+0x49c2f3
000000003c66ee30 000007fef8d57d33 mscorlib_ni+0x49c92a
000000003c66eef0 000007fef8d556e6 clr!CallDescrWorkerInternal+0x83
000000003c66ef30 000007fef8d557af clr!CallDescrWorkerWithHandler+0x4a
000000003c66ef70 000007fef8eda2c9 clr!MethodDescCallSite::CallTargetWorker+0x2e6
000000003c66f120 000007fef8ee51b0 clr!QueueUserWorkItemManagedCallback+0x2a
000000003c66f200 000007fef8ee513e clr!DebuggerU2MCatchHandlerFrame::DebuggerU2MCatchHandlerFrame+0xa0
000000003c66f240 000007fef8ee50b5 clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0x38e
000000003c66f340 000007fef8ee51eb clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0x2bd
000000003c66f3d0 000007fef8eda224 clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0x23b
000000003c66f430 000007fef8ee6baf clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0xb4
000000003c66f5c0 000007fef8ee6ab3 clr!ThreadpoolMgr::ExecuteWorkRequest+0x4c
000000003c66f5f0 000007fef8eda8a6 clr!ThreadpoolMgr::WorkerThreadStart+0xf3
000000003c66f6b0 0000000076c9652d clr!Thread::intermediateThreadProc+0x7d
000000003c66f7f0 000000007702c541 kernel32!BaseThreadInitThunk+0xd
000000003c66f820 0000000000000000 ntdll!RtlUserThreadStart+0x1d
Im having a hard time interpreting the stacktraces since they dont hit any of my applicationcode.
Are they all just idle threadworkers, waiting for work?
Threads with WaitForSingleObject are not critical, since they are waiting and not consuming CPU time. But be aware that your dump is only a snapshot and you might have had bad luck when taking the snapshot.
For a performance analysis with WinDbg you'd need several dumps during high CPU and compare them. If they all have similar stack traces, that's fine and you can conclude something. If they are all very different, it's almost useless.
The command !runaway seems more interesting here, since it lists CPU times consumed per thread, so you can identify the one(s) which are on high CPU. Again: having two snapshots that you can compare is helpful, because the main thread may still have consumed more CPU time in total than some short-living 100% threads.
If you can't use a performance profiler, SysInternals Procdump can generate a series of dumps (-n) for you on high CPU (-c). Use -s to set the time between dumps. For .NET, don't forget -ma for full memory.
Other than that, 646 threads sounds a lot to me. The OS itself could be quite busy scheduling them.
Sounds like the issue could be related to GC. Since this is a self-hosted service, it will use the Workstation GC by default, unless you enable the server GC manually:
http://msdn.microsoft.com/en-us/library/ms229357(v=vs.110).aspx
Have you tried that and see if it makes any difference?
Perfview from Microsoft may be helpful. From the link:
http://blogs.msdn.com/b/dotnet/archive/2012/10/09/improving-your-app-s-performance-with-perfview.aspx
"Late last year, Vance Morrison, who is currently an architect on the .NET Framework Performance team, released PerfView, which is a new performance tool for .NET developers. PerfView helps you discover and investigate performance hotspots in .NET Framework apps, and enables you to deliver consistently high-performance apps to your customers.
Using PerfView, you can perform complex CPU performance analyses to solve hard-to-detect performance problems. PerfView's revolutionary grouping and folding features are what makes it possible to grasp and solve these difficult problems."
use WPRUI.exe to capture a trace and analyze the CPU usage with WPA.exe.
Microsoft explained how to analyze the created trace in the following video:
Defrag Tools: #42 - WPT - CPU Analysis
http://channel9.msdn.com/Shows/Defrag-Tools/Defrag-Tools-42-WPT-CPU-Analysis
Collect ETW with Perfview and follow the big % numbers.
try run in windbg ~*e!clstack => call stacks of all threads look for repeatable code.

how to find application suspension time from GC log files

I am new to Garbage collection ,Plz some one help me to get answers for my following questions with clear explanation
I want to find application suspension time and suspension count from the GC logs files for different JVM's :
SUN
jRockit
IBM
of different versions.
A. For SUN i am using JVM options
-Xloggc:gc.log -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+UseParNewGC -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode
B. For JRockit i am using JVM options
-Xms100m -Xmx100m -Xns50m -Xss200k -Xgc:genconpar -Xverbose:gc -Xverboselog:gc_jrockit.log
My Questions are
Q1. What is suspension time of an application and why it occurs.
Q2. How to say by looking on logs that suspension was occurred.
Q3. Does suspension time of an application = sum of GC Times.
Eg:
2013-09-06T23:35:23.382-0700: [GC 150.505: [ParNew
Desired survivor size 50331648 bytes, new threshold 2 (max 15)
- age 1: 28731664 bytes, 28731664 total
- age 2: 28248376 bytes, 56980040 total
: 688128K->98304K(688128K), 0.2166700 secs] 697655K->163736K(10387456K), 0.2167900 secs] [Times: user=0.44 sys=0.04, real=0.22 secs]
2013-09-06T23:35:28.044-0700: 155.167: [GC 155.167: [ParNew
Desired survivor size 50331648 bytes, new threshold 15 (max 15)
- age 1: 22333512 bytes, 22333512 total
- age 2: 27468336 bytes, 49801848 total
: 688128K->71707K(688128K), 0.0737140 secs] 753560K->164731K(10387456K), 0.0738410 secs] [Times: user=0.30 sys=0.02, real=0.07 secs]
suspensionTime = 0.2167900 secs + 0.0738410 secs
i. If yes do i need to add all times for every gc occurs
ii. If no Plz explain me in detail for those logs we consider that suspension occured and those not consider for different Collectors with logs
Q4. Can we say GC times "0.2167900 , 0.0738410" are equal to GC Pauses ie;TotalGCPause = 0.2167900 + 0.0738410
Q5. Can we calculate suspension time by using only above flags or we need to include extra flags like -XX:+PrintGCApplicationStoppedTime for SUN
Q6. I seen an tool dyna trace it calculating suspension time and count for SUN with out using the flag -XX:+PrintGCApplicationStoppedTime
If you want the most precise information about the amount of time your application was stopped due to GC activity, you should go with -XX:+PrintGCApplicationStoppedTime.
-XX:+PrintGCApplicationStoppedTime enables the printing of the amount of time application threads have been stopped as the result of an internal HotSpot VM operation (GC and safe-point operations).
But, for practical daily usage the information provided by the GC logs is sufficient. You can use the approach described in your question 3. to determine the time spent in the GC.

Resources