Spark executor frequently java GC - apache-spark

Spark executor start with the following options
/root/spark/jdk1.8.0_151/bin/java -cp /root/spark/spark-2.2.0-bin-hadoop2.7/conf/:/root/spark/spark-2.2.0-bin-hadoop2.7/jars/* -Xmx6144M -Dspark.driver.port=20637 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#172.16.50.102:20637 --executor-id 29 --hostname 172.16.50.103 --cores 2 --app-id app-20180131184049-0002 --worker-url spark://Worker#172.16.50.103:39368
And I got frequently Java GC (from executor's log):
2.431: [GC (Metadata GC Threshold) [PSYoungGen: 362763K->34308K(611840K)] 362763K->34396K(2010112K), 0.0780262 secs] [Times: user=1.09 sys=0.18, real=0.08 secs]
2.509: [Full GC (Metadata GC Threshold) [PSYoungGen: 34308K->0K(611840K)] [ParOldGen: 88K->32991K(772096K)] 34396K->32991K(1383936K), [Metaspace: 20866K->20866K(1067008K)], 0.0541261 secs] [Times: user=0.70 sys=0.08, real=0.05 secs]
303.670: [GC (Allocation Failure) [PSYoungGen: 524800K->87035K(834560K)] 557791K->266418K(1606656K), 0.1241616 secs] [Times: user=2.92 sys=0.51, real=0.12 secs]
315.196: [GC (Allocation Failure) [PSYoungGen: 834555K->87032K(1136640K)] 1013938K->981300K(2037248K), 0.4551608 secs] [Times: user=12.47 sys=5.12, real=0.46 secs]
315.651: [Full GC (Ergonomics) [PSYoungGen: 87032K->69466K(1136640K)] [ParOldGen: 894267K->887330K(2752000K)] 981300K->956797K(3888640K), [Metaspace: 34446K->34446K(1079296K)], 5.9107553 secs] [Times: user=227.48 sys=4.56, real=5.91 secs]
336.571: [GC (Allocation Failure) [PSYoungGen: 1119066K->87030K(1225728K)] 2006397K->1979465K(3977728K), 0.7949645 secs] [Times: user=22.85 sys=10.80, real=0.79 secs]
337.366: [Full GC (Ergonomics) [PSYoungGen: 87030K->0K(1225728K)] [ParOldGen: 1892434K->1975360K(4194304K)] 1979465K->1975360K(5420032K), [Metaspace: 34446K->34446K(1079296K)], 12.1924380 secs] [Times: user=488.02 sys=4.94, real=12.20 secs]
366.596: [GC (Allocation Failure) [PSYoungGen: 1138688K->87012K(1225728K)] 3114048K->3116557K(5420032K), 0.9059287 secs] [Times: user=31.37 sys=5.71, real=0.91 secs]
367.502: [Full GC (Ergonomics) [PSYoungGen: 87012K->0K(1225728K)] [ParOldGen: 3029544K->3096222K(4194304K)] 3116557K->3096222K(5420032K), [Metaspace: 34449K->34449K(1079296K)], 13.1129752 secs] [Times: user=518.70 sys=11.04, real=13.11 secs]
396.419: [Full GC (Ergonomics) [PSYoungGen: 1138688K->1023K(1225728K)] [ParOldGen: 3096222K->4193874K(4194304K)] 4234910K->4194898K(5420032K), [Metaspace: 34456K->34456K(1079296K)], 17.7615804 secs] [Times: user=714.95 sys=22.54, real=17.76 secs]
430.400: [Full GC (Ergonomics) [PSYoungGen: 1138688K->1101822K(1225728K)] [ParOldGen: 4193874K->4193922K(4194304K)] 5332562K->5295744K(5420032K), [Metaspace: 34462K->34462K(1079296K)], 24.3810387 secs] [Times: user=997.79 sys=15.83, real=24.38 secs]
454.851: [Full GC (Ergonomics) [PSYoungGen: 1138688K->1130794K(1225728K)] [ParOldGen: 4193922K->4193922K(4194304K)] 5332610K->5324716K(5420032K), [Metaspace: 34477K->34477K(1079296K)], 26.3723404 secs] [Times: user=1086.31 sys=11.56, real=26.37 secs]
481.226: [Full GC (Ergonomics) [PSYoungGen: 1138688K->1130798K(1225728K)] [ParOldGen: 4193922K->4193922K(4194304K)] 5332610K->5324720K(5420032K), [Metaspace: 34477K->34477K(1079296K)], 19.2936132 secs] [Times: user=779.84 sys=22.07, real=19.30 secs]
500.521: [Full GC (Ergonomics) [PSYoungGen: 1138688K->1130862K(1225728K)] [ParOldGen: 4193922K->4193922K(4194304K)] 5332610K->5324784K(5420032K), [Metaspace: 34477K->34477K(1079296K)], 22.6870152 secs] [Times: user=926.71 sys=18.37, real=22.69 secs]
Cause the frequently GC, make executor hang for long time, and can not report status to driver, so driver kill the executor.
The Spark program:
Iterator iter = this.dbtable.entrySet().iterator();
while (iter.hasNext()) {
Map.Entry me = (Map.Entry) iter.next();
String dt = "(" + me.getValue() + ")" + me.getKey();
logger.info("[\033[32m" + dt + "\033[0m]");
Dataset<Row> jdbcDF = ss.read().format("jdbc")
.option("driver", "com.mysql.jdbc.Driver")
.option("url", this.url)
.option("dbtable", dt)
.option("user", this.user)
.option("password", this.password)
.option("useSSL", false)
.load();
jdbcDF.createOrReplaceTempView((String) me.getKey());
}
Dataset<Row> result = ss.sql(this.sql);
result.write().format("jdbc")
.option("driver", "com.mysql.jdbc.Driver")
.option("url", this.dst_url)
.option("dbtable", this.dst_table)
.option("user", this.user)
.option("password", this.password)
.option("useSSL", false)
.option("rewriteBatchedStatements", true)
.option("sessionVariables","sql_log_bin=off")
.save();
The stacktrace:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3181)
at java.util.ArrayList.grow(ArrayList.java:265)
at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:239)
at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:231)
at java.util.ArrayList.add(ArrayList.java:462)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3414)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3112)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2341)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2736)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2484)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1966)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:301)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Related

Filter spark dataframe based on another dataframe columns by converting it into list

df = spark.createDataFrame([("1gh","25g","36h"),("2gf","3ku","4we"),("12w","53v","c74"),("1a2","3d4","4c5"),("232","3df","4rt")], ["a","b","c"])
filter_df = spark.createDataFrame([("2gf","3ku"),("12w","53v"), ("12w","53v")], ["a","b"])
I took "a" OF "filter_df" and created an rdd to then to list from the following code
unique_list = filter_df.select("a").rdd.flatMap(lambda x: x).distinct().collect()
This gives me:
unique_list = [u'2gf', u'12w']
Tried converting the rdd into list using collect operation. But this gives me Allocation Errors shown below
final_df = df.filter(F.col("a").isin(unique_list))
118.255: [GC (Allocation Failure) [PSYoungGen: 1380832K->538097K(1772544K)] 2085158K->1573272K(3994112K), 0.0622847 secs] [Times: user=2.31 sys=1.76, real=0.06 secs]
122.540: [GC (Allocation Failure) [PSYoungGen: 1772529K->542497K(2028544K)] 2807704K->1581484K(4250112K), 0.3217980 secs] [Times: user=11.16 sys=13.15, real=0.33 secs]
127.071: [GC (Allocation Failure) [PSYoungGen: 1776929K->542721K(2411008K)] 2815916K->1582011K(4632576K), 0.8024852 secs] [Times: user=58.43 sys=4.85, real=0.80 secs]
133.284: [GC (Allocation Failure) [PSYoungGen: 2106881K->400752K(2446848K)] 3146171K->1583953K(4668416K), 0.4198589 secs] [Times: user=18.31 sys=12.58, real=0.42 secs]
139.050: [GC (Allocation Failure) [PSYoungGen: 1964912K->10304K(2993152K)] 3148113K->1584408K(5214720K), 0.0712454 secs] [Times: user=2.92 sys=0.88, real=0.08 secs]
146.638: [GC (Allocation Failure) [PSYoungGen: 2188864K->12768K(3036160K)] 3762968K->1588544K(5257728K), 0.1212116 secs] [Times: user=3.05 sys=3.74, real=0.12 secs]
154.153: [GC (Allocation Failure) [PSYoungGen: 2191328K->12128K(3691008K)] 3767104K->1590112K(5912576K), 0.1179030 secs] [Times: user=6.94 sys=0.11, real=0.12 secs
required Output:
final_df
+---+---+---+
| a| b| c|
+---+---+---+
|2gf|3ku|4we|
|12w|53v|c74|
+---+---+---+
What is the effective to filter out the spark dataframe using another rdd or a list or a different dataframe. The above mentioned data is sample. I have bigger dataset in real time
Use left_semi join.
df.join(filter_df, ['a','b'],'left_semi')
You can use inner join:
df.join(filter_df).where(df.a == filter_df.a & df.b == filter_df.b)

I am sufferring JAVA G1 issue

does any one encounter this kind of issue in java G1 gc
the first highlight user time is about 4 ms
but the second one user time is 0 ms and system time is about 4ms.
in G1 gc system time shouldn't be high, is it a bug in G1 gc?
below is my gc argunments
Xms200g -Xmx200g -Xmn30g -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSCompactAtFullCollection -XX:CMSMaxAbortablePrecleanTime=5000 -XX:+CMSClassUnloadingEnabled -XX:+CMSScavengeBeforeRemark -verbose:gc -XX:+PrintPromotionFailure -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC
2018-01-07T04:54:39.995+0800: 906650.864: [GC (Allocation Failure) 2018-01-07T04:54:39.996+0800: 906650.865: [ParNew
Desired survivor size 1610612736 bytes, new threshold 6 (max 6)
- age 1: 69747632 bytes, 69747632 total
- age 2: 9641544 bytes, 79389176 total
- age 3: 10522192 bytes, 89911368 total
- age 4: 11732392 bytes, 101643760 total
- age 5: 9158960 bytes, 110802720 total
- age 6: 10917528 bytes, 121720248 total
: 25341731K->170431K(28311552K), 0.2088528 secs] 153045380K->127882325K(206569472K), 0.2094236 secs] [Times: **user=4.53 sys=0.00, real=0.21 secs]**
Heap after GC invocations=32432 (full 10):
par new generation total 28311552K, used 170431K [0x00007f6058000000, 0x00007f67d8000000, 0x00007f67d8000000)
eden space 25165824K, 0% used [0x00007f6058000000, 0x00007f6058000000, 0x00007f6658000000)
from space 3145728K, 5% used [0x00007f6658000000, 0x00007f666266ffe0, 0x00007f6718000000)
to space 3145728K, 0% used [0x00007f6718000000, 0x00007f6718000000, 0x00007f67d8000000)
concurrent mark-sweep generation total 178257920K, used 127711893K [0x00007f67d8000000, 0x00007f9258000000, 0x00007f9258000000)
Metaspace used 54995K, capacity 55688K, committed 56028K, reserved 57344K
}
2018-01-07T04:54:40.205+0800: 906651.074: Total time for which application threads were stopped: 0.2269738 seconds, Stopping threads took: 0.0001692 seconds
{Heap before GC invocations=32432 (full 10):
par new generation total 28311552K, used 25336255K [0x00007f6058000000, 0x00007f67d8000000, 0x00007f67d8000000)
eden space 25165824K, 100% used [0x00007f6058000000, 0x00007f6658000000, 0x00007f6658000000)
from space 3145728K, 5% used [0x00007f6658000000, 0x00007f666266ffe0, 0x00007f6718000000)
to space 3145728K, 0% used [0x00007f6718000000, 0x00007f6718000000, 0x00007f67d8000000)
concurrent mark-sweep generation total 178257920K, used 127711893K [0x00007f67d8000000, 0x00007f9258000000, 0x00007f9258000000)
Metaspace used 54995K, capacity 55688K, committed 56028K, reserved 57344K
2018-01-07T04:55:02.541+0800: 906673.411: [GC (Allocation Failure) 2018-01-07T04:55:02.542+0800: 906673.411: [ParNew
Desired survivor size 1610612736 bytes, new threshold 6 (max 6)
- age 1: 93841912 bytes, 93841912 total
- age 2: 11310104 bytes, 105152016 total
- age 3: 8967160 bytes, 114119176 total
- age 4: 10278920 bytes, 124398096 total
- age 5: 11626160 bytes, 136024256 total
- age 6: 9077432 bytes, 145101688 total
: 25336255K->195827K(28311552K), 0.1926783 secs] 153048149K->127918291K(206569472K), 0.1932366 secs] [Times: **user=0.00 sys=4.07, real=0.20 secs]**
Heap after GC invocations=32433 (full 10):
par new generation total 28311552K, used 195827K [0x00007f6058000000, 0x00007f67d8000000, 0x00007f67d8000000)
eden space 25165824K, 0% used [0x00007f6058000000, 0x00007f6058000000, 0x00007f6658000000)
from space 3145728K, 6% used [0x00007f6718000000, 0x00007f6723f3cf38, 0x00007f67d8000000)
to space 3145728K, 0% used [0x00007f6658000000, 0x00007f6658000000, 0x00007f6718000000)
concurrent mark-sweep generation total 178257920K, used 127722463K [0x00007f67d8000000, 0x00007f9258000000, 0x00007f9258000000)
Metaspace used 54995K, capacity 55688K, committed 56028K, reserved 57344K
}
2018-01-07T04:55:02.735+0800: 906673.604: Total time for which application threads were stopped: 0.2149603 seconds, Stopping threads took: 0.0002262 seconds
2018-01-07T04:55:14.673+0800: 906685.542: Total time for which application threads were stopped: 0.0183883 seconds, Stopping threads took: 0.0002046 seconds
2018-01-07T04:55:14.797+0800: 906685.666: Total time for which application threads were stopped: 0.0135349 seconds, Stopping threads took: 0.0002472 seconds
2018-01-07T04:55:14.810+0800: 906685.679: Total time for which application threads were stopped: 0.0129019 seconds, Stopping threads took: 0.0001014 seconds
2018-01-07T04:55:14.823+0800: 906685.692: Total time for which application threads were stopped: 0.0125939 seconds, Stopping threads took: 0.0002915 seconds
2018-01-07T04:55:21.597+0800: 906692.466: Total time for which application threads were stopped: 0.0137018 seconds, Stopping threads took: 0.0001683 seconds
{Heap before GC invocations=32433 (full 10):
your command-line specifies -XX:+UseConcMarkSweepGC - this isn't a G1 issue.

Spark: graphx api OOM errors after unpersist useless RDDs

I have met an Out Of Memeory error with unknown reasons, I have released the useless RDDs immediately, but after several round of loop, OOM error still come out. My code is as following:
// single source shortest path
def sssp[VD](graph:Graph[VD,Double], source: VertexId): Graph[Double, Double] = {
graph.mapVertices((id, _) => if (id == source) 0.0 else Double.PositiveInfinity)
.pregel(Double.PositiveInfinity)(
(id, dist, newDist) => scala.math.min(dist, newDist),
triplet => {
if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
}
else {
Iterator.empty
}
},
(a, b) => math.min(a, b)
)
}
def selectCandidate(candidates: RDD[(VertexId, (Double, Double))]): VertexId = {
Random.setSeed(System.nanoTime())
val selectLow = Random.nextBoolean()
val (vid, (_, _)) = if (selectLow) {
println("Select lowest bound")
candidates.reduce((x, y) => if (x._2._1 < y._2._1) x else y)
} else {
println("Select highest bound")
candidates.reduce((x, y) => if (x._2._2 > y._2._2) x else y)
}
vid
}
val g = {/* load graph from hdfs*/}.partitionBy(EdgePartition2D,eParts).cache
println("Vertices Size: " + g.vertices.count )
println("Edges Size: " + g.edges.count )
val resultDiameter = {
val diff = 0d
val maxIterations = 100
val filterJoin = 1e5
val vParts = 100
var deltaHigh = Double.PositiveInfinity
var deltaLow = Double.NegativeInfinity
var candidates = g.vertices.map(x => (x._1, (Double.NegativeInfinity,
Double.PositiveInfinity)))
.partitionBy(new HashPartitioner(vParts))
.persist(StorageLevel.MEMORY_AND_DISK) // (vid, low, high)
var round = 0
var candidateCount = candidates.count
while (deltaHigh - deltaLow > diff && candidateCount > 0 && round <= maxIterations) {
val currentVertex = dia.selectCandidate(candidates)
val dist: RDD[(VertexId, Double)] = dia.sssp(g, currentVertex)
.vertices
.partitionBy(new HashPartitioner(vParts)) // join more efficiently
.persist(StorageLevel.MEMORY_AND_DISK)
val eccentricity = dist.map({ case (vid, length) => length }).max
println("Eccentricity = %.1f".format(eccentricity))
val subDist = if(candidateCount > filterJoin) {
println("Directly use Dist")
dist
} else { // when candidates is small than filterJoin, filter the useless vertices
println("Filter Dist")
val candidatesMap = candidates.sparkContext.broadcast(
candidates.collect.toMap)
val subDist = dist.filter({case (vid, length) =>
candidatesMap.value.contains(vid)})
.persist(StorageLevel.MEMORY_AND_DISK)
println("Sub Dist Count: " + subDist.count)
subDist
}
var previousCandidates = candidates
candidates = candidates.join(subDist).map({ case (vid, ((low, high), d)) =>
(vid,
(Array(low, eccentricity - d, d).max,
Array(high, eccentricity + d).min))
}).persist(StorageLevel.MEMORY_AND_DISK)
candidateCount = candidates.count
println("Candidates Count 1 : " + candidateCount)
previousCandidates.unpersist(true) // release useless rdd
dist.unpersist(true) // release useless rdd
deltaLow = Array(deltaLow,
candidates.map({ case (_, (low, _)) => low }).max).max
deltaHigh = Array(deltaHigh, 2 * eccentricity,
candidates.map({ case (_, (_, high)) => high }).max).min
previousCandidates = candidates
candidates = candidates.filter({ case (_, (low, high)) =>
!((high <= deltaLow && low >= deltaHigh / 2d) || low == high)
})
.partitionBy(new HashPartitioner(vParts)) // join more efficiently
.persist(StorageLevel.MEMORY_AND_DISK)
candidateCount = candidates.count
println("Candidates Count 2:" + candidateCount)
previousCandidates.unpersist(true) // release useless rdd
round += 1
println(s"Round=${round},Low=${deltaLow}, High=${deltaHigh}, Candidates=${candidateCount}")
}
deltaLow
}
println(s"Diameter $resultDiameter")
println("Complete!")
The main data in the while block is a graph object g and an RDD candidates. g is used to compute single source shourtest path in each round and graph structure not changed. candidates size will be decreased round by round.
In each round, I manually unpersist the useless rdd with blocking mode, so I think it should have enough memory for the following operations. However, it stops for OOM in round 7 or 6 randomly. When the program came in round 6 or 7, candidates decrease seriously, about 10% or less of the origin one. Output sample as following, the candidates size decrease from 15,288,624 in round 1 to 67,451 in round 7:
Vertices Size: 15,288,624
Edges Size: 228,097,574
Select lowest bound
Eccentricity = 12.0
Directly use Dist
Candidates Count 1 : 15288624
Candidates Count 2:15288623
Round=1,Low=12.0, High=24.0, Candidates=15288623
Select lowest bound
Eccentricity = 13.0
Directly use Dist
Candidates Count 1 : 15288623
Candidates Count 2:15288622
Round=2,Low=13.0, High=24.0, Candidates=15288622
Select highest bound
Eccentricity = 18.0
Directly use Dist
Candidates Count 1 : 15288622
Candidates Count 2:6578370
Round=3,Low=18.0, High=23.0, Candidates=6578370
Select lowest bound
Eccentricity = 12.0
Directly use Dist
Candidates Count 1 : 6578370
Candidates Count 2:6504563
Round=4,Low=18.0, High=23.0, Candidates=6504563
Select lowest bound
Eccentricity = 11.0
Directly use Dist
Candidates Count 1 : 6504563
Candidates Count 2:412789
Round=5,Low=18.0, High=22.0, Candidates=412789
Select highest bound
Eccentricity = 17.0
Directly use Dist
Candidates Count 1 : 412789
Candidates Count 2:288670
Round=6,Low=18.0, High=22.0, Candidates=288670
Select highest bound
Eccentricity = 18.0
Directly use Dist
Candidates Count 1 : 288670
Candidates Count 2:67451
Round=7,Low=18.0, High=22.0, Candidates=67451
The near ends of the spark.info log
6/12/12 14:03:09 WARN YarnAllocator: Expected to find pending requests, but found none.
16/12/12 14:06:21 INFO YarnAllocator: Canceling requests for 0 executor containers
16/12/12 14:06:33 WARN YarnAllocator: Expected to find pending requests, but found none.
16/12/12 14:14:26 WARN NioEventLoop: Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space
16/12/12 14:18:14 WARN NioEventLoop: Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space
at io.netty.util.internal.MpscLinkedQueue.offer(MpscLinkedQueue.java:123)
at io.netty.util.internal.MpscLinkedQueue.add(MpscLinkedQueue.java:218)
at io.netty.util.concurrent.SingleThreadEventExecutor.fetchFromScheduledTaskQueue(SingleThreadEventExecutor.java:260)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:347)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:374)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
at java.lang.Thread.run(Thread.java:744)
16/12/12 14:18:14 WARN DFSClient: DFSOutputStream ResponseProcessor exception for block BP-552217672-100.76.16.204-1470826698239:blk_1377987137_304302272
java.io.EOFException: Premature EOF: no length prefix available
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1492)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:116)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:721)
16/12/12 14:14:39 WARN AbstractConnector:
java.lang.OutOfMemoryError: Java heap space
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:233)
at org.spark-project.jetty.server.nio.SelectChannelConnector.accept(SelectChannelConnector.java:109)
at org.spark-project.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:938)
at org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:744)
16/12/12 14:20:06 INFO ApplicationMaster: Final app status: FAILED, exitCode: 12, (reason: Exception was thrown 1 time(s) from Reporter thread.)
16/12/12 14:19:38 WARN DFSClient: Error Recovery for block BP-552217672-100.76.16.204-1470826698239:blk_1377987137_304302272 in pipeline 100.76.15.28:9003, 100.76.48.218:9003, 100.76.48.199:9003: bad datanode 100.76.15.28:9003
16/12/12 14:18:58 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
16/12/12 14:20:49 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-198] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
16/12/12 14:20:49 INFO SparkContext: Invoking stop() from shutdown hook
16/12/12 14:20:49 INFO ContextCleaner: Cleaned shuffle 446
16/12/12 14:20:49 WARN AkkaRpcEndpointRef: Error sending message [message = RemoveRdd(2567)] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Recipient[Actor[akka://sparkDriver/user/BlockManagerMaster#-213595070]] had already been terminated.. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Failure.recover(Try.scala:185)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.scala$concurrent$impl$Promise$DefaultPromise$$dispatchOrAddCallback(Promise.scala:280)
at scala.concurrent.impl.Promise$DefaultPromise.onComplete(Promise.scala:270)
at scala.concurrent.Future$class.recover(Future.scala:324)
at scala.concurrent.impl.Promise$DefaultPromise.recover(Promise.scala:153)
at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.ask(AkkaRpcEnv.scala:376)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:100)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
at org.apache.spark.storage.BlockManagerMaster.removeRdd(BlockManagerMaster.scala:104)
at org.apache.spark.SparkContext.unpersistRDD(SparkContext.scala:1630)
at org.apache.spark.ContextCleaner.doCleanupRDD(ContextCleaner.scala:208)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:185)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:180)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:180)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:173)
at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
Caused by: akka.pattern.AskTimeoutException: Recipient[Actor[akka://sparkDriver/user/BlockManagerMaster#-213595070]] had already been terminated.
at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:132)
at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.ask(AkkaRpcEnv.scala:364)
... 12 more
16/12/12 14:20:49 WARN QueuedThreadPool: 5 threads could not be stopped
16/12/12 14:20:49 INFO SparkUI: Stopped Spark web UI at http://10.215.154.152:56338
16/12/12 14:20:49 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/12/12 14:20:49 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/12/12 14:21:04 WARN AkkaRpcEndpointRef: Error sending message [message = RemoveRdd(2567)] in 2 attempts
org.apache.spark.rpc.RpcTimeoutException: Recipient[Actor[akka://sparkDriver/user/BlockManagerMaster#-213595070]] had already been terminated.. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
The near ends of the gc.log
2016-12-12T14:10:43.541+0800: 16832.953: [Full GC 2971008K->2971007K(2971008K), 11.4284920 secs]
2016-12-12T14:10:54.990+0800: 16844.403: [Full GC 2971007K->2971007K(2971008K), 11.4479110 secs]
2016-12-12T14:11:06.457+0800: 16855.870: [GC 2971007K(2971008K), 0.6827710 secs]
2016-12-12T14:11:08.825+0800: 16858.237: [Full GC 2971007K->2971007K(2971008K), 11.5480350 secs]
2016-12-12T14:11:20.384+0800: 16869.796: [Full GC 2971007K->2971007K(2971008K), 11.0481490 secs]
2016-12-12T14:11:31.442+0800: 16880.855: [Full GC 2971007K->2971007K(2971008K), 11.0184790 secs]
2016-12-12T14:11:42.472+0800: 16891.884: [Full GC 2971008K->2971008K(2971008K), 11.3124900 secs]
2016-12-12T14:11:53.795+0800: 16903.207: [Full GC 2971008K->2971008K(2971008K), 10.9517160 secs]
2016-12-12T14:12:04.760+0800: 16914.172: [Full GC 2971008K->2971007K(2971008K), 11.0969500 secs]
2016-12-12T14:12:15.868+0800: 16925.281: [Full GC 2971008K->2971008K(2971008K), 11.1244090 secs]
2016-12-12T14:12:27.003+0800: 16936.416: [Full GC 2971008K->2971008K(2971008K), 11.0206800 secs]
2016-12-12T14:12:38.035+0800: 16947.448: [Full GC 2971008K->2971008K(2971008K), 11.0024270 secs]
2016-12-12T14:12:49.048+0800: 16958.461: [Full GC 2971008K->2971008K(2971008K), 10.9831440 secs]
2016-12-12T14:13:00.042+0800: 16969.454: [GC 2971008K(2971008K), 0.7338780 secs]
2016-12-12T14:13:02.496+0800: 16971.908: [Full GC 2971008K->2971007K(2971008K), 11.1536860 secs]
2016-12-12T14:13:13.661+0800: 16983.074: [Full GC 2971007K->2971007K(2971008K), 10.9956150 secs]
2016-12-12T14:13:24.667+0800: 16994.080: [Full GC 2971007K->2971007K(2971008K), 11.0139660 secs]
2016-12-12T14:13:35.691+0800: 17005.104: [GC 2971007K(2971008K), 0.6693770 secs]
2016-12-12T14:13:38.115+0800: 17007.527: [Full GC 2971007K->2971006K(2971008K), 11.0514040 secs]
2016-12-12T14:13:49.178+0800: 17018.590: [Full GC 2971007K->2971007K(2971008K), 10.8881160 secs]
2016-12-12T14:14:00.076+0800: 17029.489: [GC 2971007K(2971008K), 0.7046370 secs]
2016-12-12T14:14:02.498+0800: 17031.910: [Full GC 2971007K->2971007K(2971008K), 11.3424300 secs]
2016-12-12T14:14:13.862+0800: 17043.274: [Full GC 2971008K->2971006K(2971008K), 11.6215890 secs]
2016-12-12T14:14:25.503+0800: 17054.915: [GC 2971006K(2971008K), 0.7196840 secs]
2016-12-12T14:14:27.857+0800: 17057.270: [Full GC 2971008K->2971007K(2971008K), 11.3879990 secs]
2016-12-12T14:14:39.266+0800: 17068.678: [Full GC 2971007K->2971007K(2971008K), 11.1611420 secs]
2016-12-12T14:14:50.446+0800: 17079.859: [GC 2971007K(2971008K), 0.6976180 secs]
2016-12-12T14:14:52.782+0800: 17082.195: [Full GC 2971007K->2971007K(2971008K), 11.4318900 secs]
2016-12-12T14:15:04.235+0800: 17093.648: [Full GC 2971007K->2971007K(2971008K), 11.3429010 secs]
2016-12-12T14:15:15.598+0800: 17105.010: [GC 2971007K(2971008K), 0.6832320 secs]
2016-12-12T14:15:17.930+0800: 17107.343: [Full GC 2971008K->2971007K(2971008K), 11.1898520 secs]
2016-12-12T14:15:29.131+0800: 17118.544: [Full GC 2971007K->2971007K(2971008K), 10.9680150 secs]
2016-12-12T14:15:40.110+0800: 17129.522: [GC 2971007K(2971008K), 0.7444890 secs]
2016-12-12T14:15:42.508+0800: 17131.920: [Full GC 2971007K->2971007K(2971008K), 11.3052160 secs]
2016-12-12T14:15:53.824+0800: 17143.237: [Full GC 2971007K->2971007K(2971008K), 10.9484100 secs]
2016-12-12T14:16:04.783+0800: 17154.196: [Full GC 2971007K->2971007K(2971008K), 10.9543950 secs]
2016-12-12T14:16:15.748+0800: 17165.160: [GC 2971007K(2971008K), 0.7066150 secs]
2016-12-12T14:16:18.176+0800: 17167.588: [Full GC 2971007K->2971007K(2971008K), 11.1201370 secs]
2016-12-12T14:16:29.307+0800: 17178.719: [Full GC 2971007K->2971007K(2971008K), 11.0746950 secs]
2016-12-12T14:16:40.392+0800: 17189.805: [Full GC 2971007K->2971007K(2971008K), 11.0036170 secs]
2016-12-12T14:16:51.407+0800: 17200.819: [Full GC 2971007K->2971007K(2971008K), 10.9655670 secs]
2016-12-12T14:17:02.383+0800: 17211.796: [Full GC 2971007K->2971007K(2971008K), 10.7348560 secs]
2016-12-12T14:17:13.128+0800: 17222.540: [GC 2971007K(2971008K), 0.6679470 secs]
2016-12-12T14:17:15.450+0800: 17224.862: [Full GC 2971007K->2971007K(2971008K), 10.6219270 secs]
2016-12-12T14:17:26.081+0800: 17235.494: [Full GC 2971007K->2971007K(2971008K), 10.9158450 secs]
2016-12-12T14:17:37.016+0800: 17246.428: [Full GC 2971007K->2971007K(2971008K), 11.3107490 secs]
2016-12-12T14:17:48.337+0800: 17257.750: [Full GC 2971007K->2971007K(2971008K), 11.0769460 secs]
2016-12-12T14:17:59.424+0800: 17268.836: [GC 2971007K(2971008K), 0.6707600 secs]
2016-12-12T14:18:01.850+0800: 17271.262: [Full GC 2971007K->2970782K(2971008K), 12.6348300 secs]
2016-12-12T14:18:14.496+0800: 17283.909: [GC 2970941K(2971008K), 0.7525790 secs]
2016-12-12T14:18:16.890+0800: 17286.303: [Full GC 2971006K->2970786K(2971008K), 13.1047470 secs]
2016-12-12T14:18:30.008+0800: 17299.421: [GC 2970836K(2971008K), 0.8139710 secs]
2016-12-12T14:18:32.458+0800: 17301.870: [Full GC 2971005K->2970873K(2971008K), 13.0410540 secs]
2016-12-12T14:18:45.512+0800: 17314.925: [Full GC 2971007K->2970893K(2971008K), 12.7169690 secs]
2016-12-12T14:18:58.239+0800: 17327.652: [GC 2970910K(2971008K), 0.7314350 secs]
2016-12-12T14:19:00.557+0800: 17329.969: [Full GC 2971008K->2970883K(2971008K), 11.1889000 secs]
2016-12-12T14:19:11.767+0800: 17341.180: [Full GC 2971006K->2970940K(2971008K), 11.4069700 secs]
2016-12-12T14:19:23.185+0800: 17352.597: [GC 2970950K(2971008K), 0.6689360 secs]
2016-12-12T14:19:25.484+0800: 17354.896: [Full GC 2971007K->2970913K(2971008K), 12.6980050 secs]
2016-12-12T14:19:38.194+0800: 17367.607: [Full GC 2971004K->2970902K(2971008K), 12.7641130 secs]
2016-12-12T14:19:50.968+0800: 17380.380: [GC 2970921K(2971008K), 0.6966130 secs]
2016-12-12T14:19:53.266+0800: 17382.678: [Full GC 2971007K->2970875K(2971008K), 12.9416660 secs]
2016-12-12T14:20:06.233+0800: 17395.645: [Full GC 2971007K->2970867K(2971008K), 13.2740780 secs]
2016-12-12T14:20:19.527+0800: 17408.939: [GC 2970881K(2971008K), 0.7696770 secs]
2016-12-12T14:20:22.024+0800: 17411.436: [Full GC 2971007K->2970886K(2971008K), 13.8729770 secs]
2016-12-12T14:20:35.919+0800: 17425.331: [Full GC 2971002K->2915146K(2971008K), 12.8270160 secs]
2016-12-12T14:20:48.762+0800: 17438.175: [GC 2915155K(2971008K), 0.6856650 secs]
2016-12-12T14:20:51.271+0800: 17440.684: [Full GC 2971007K->2915307K(2971008K), 12.4895750 secs]
2016-12-12T14:21:03.771+0800: 17453.184: [GC 2915320K(2971008K), 0.6249910 secs]
2016-12-12T14:21:06.377+0800: 17455.789: [Full GC 2971007K->2914274K(2971008K), 12.6835220 secs]
2016-12-12T14:21:19.129+0800: 17468.541: [GC 2917963K(2971008K), 0.6917090 secs]
2016-12-12T14:21:21.526+0800: 17470.938: [Full GC 2971007K->2913949K(2971008K), 13.0442320 secs]
2016-12-12T14:21:36.588+0800: 17486.000: [GC 2936827K(2971008K), 0.7244690 secs]
So, the logs show that there might be memory leak existing, it might occur in two place:
1) my code or 2) code in spark graphx api
Can anyone help me find out the reason if it occurs in my code?
I don't think unpersist() API is causing out of memory. OutOfMemory is caused by collect() API because collect() (which is an Action unlike Transformation) fetches the entire RDD to a single driver machine.
Few suggestions:
Increasing the RAM in driver memory is one partial solution, which you have already implemented. If you are working with jdk 8, use G1GC collector to manage large heaps.
You can play with storage levels (MEMORY_AND_DISK, OFF_HEAP etc) to fine-tune it for your application.
Have a look at this official documentation guide for more details.
I haven't solved the problem completely, but I have fix it partly,
Increase the driver memory. I have mentioned above that it stoped in round 6 or 7, but when I double the driver memory, it would stop at round 14. So, I think driver memory OOM might be one reason.
Save the candidates RDD to HDFS, and continue the process at the next time. So, the compuation before will not be wasted.
Serialize candidates RDD with Kryo. It will cost some computation on decode and encode, but saves a greate amount of memory.
There are not the perfect solution, but it does work in my case. However, I hope other guys would give the perfect one.

Thread dump, CPU utilization analysis

We have one issue in our application.
As the application runs for somedays, at some point of time the cpu utilization reaches 100%, which leads to slow response from application.
As i gone through all link, i got thread top shift+h, and also thread dumps.
COnverted the process id into hex ans searched for that in thread dump.
i find the following details in thread dump
"Concurrent Mark-Sweep GC Thread" prio=10 tid=0x0000000045cce000 nid=0x10d1 runnable
"Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x0000000045cc6000 nid=0x10cd runnable
"Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x0000000045cc8000 nid=0x10ce runnable
"Gang worker#2 (Parallel CMS Threads)" prio=10 tid=0x0000000045cc9800 nid=0x10cf runnable
"Gang worker#3 (Parallel CMS Threads)" prio=10 tid=0x0000000045ccb800 nid=0x10d0 runnable
"VM Periodic Task Thread" prio=10 tid=0x00002b8c48012000 nid=0x10da waiting on condition
JNI global references: 1231
Heap
par new generation total 436928K, used 93221K [0x00000006b0000000, 0x00000006d0000000, 0x00000006d0000000)
eden space 349568K, 21% used [0x00000006b0000000, 0x00000006b484f438, 0x00000006c5560000)
from space 87360K, 21% used [0x00000006c5560000, 0x00000006c681a020, 0x00000006caab0000)
to space 87360K, 0% used [0x00000006caab0000, 0x00000006caab0000, 0x00000006d0000000)
concurrent mark-sweep generation total 4718592K, used 3590048K [0x00000006d0000000, 0x00000007f0000000, 0x00000007f0000000)
concurrent-mark-sweep perm gen total 262144K, used 217453K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
CMS: abort preclean due to time 2015-09-24T14:16:14.752+0200: 4505865.908: [CMS-concurrent-abortable-preclean: 4.332/5.134 secs] [Times: user=5.22 sys=0.08, real=5.14 secs]
2015-09-24T14:16:14.756+0200: 4505865.912: [GC[YG occupancy: 127725 K (436928 K)]4505865.912: [Rescan (parallel) , 0.0602290 secs]4505865.973: [weak refs processing, 0.0000220 secs] [1 CMS-remark: 3590048K(4718592K)] 3717774K(5155520K), 0.0604150 secs] [Times: user=0.64 sys=0.00, real=0.06 secs]
2015-09-24T14:16:14.817+0200: 4505865.973: [CMS-concurrent-sweep-start]
2015-09-24T14:16:18.048+0200: 4505869.204: [CMS-concurrent-sweep: 3.227/3.231 secs] [Times: user=3.37 sys=0.03, real=3.23 secs]
2015-09-24T14:16:18.048+0200: 4505869.204: [CMS-concurrent-reset-start]
2015-09-24T14:16:18.058+0200: 4505869.214: [CMS-concurrent-reset: 0.010/0.010 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2015-09-24T14:16:18.312+0200: 4505869.468: [GC [1 CMS-initial-mark: 3590044K(4718592K)] 3788126K(5155520K), 0.2487070 secs] [Times: user=0.25 sys=0.00, real=0.25 secs]
2015-09-24T14:16:18.561+0200: 4505869.717: [CMS-concurrent-mark-start]
2015-09-24T14:16:23.202+0200: 4505874.358: [CMS-concurrent-mark: 4.626/4.641 secs] [Times: user=17.89 sys=0.39, real=4.64 secs]
2015-09-24T14:16:23.202+0200: 4505874.358: [CMS-concurrent-preclean-start]
2015-09-24T14:16:24.094+0200: 4505875.250: [CMS-concurrent-preclean: 0.891/0.891 secs] [Times: user=0.95 sys=0.01, real=0.90 secs]
2015-09-24T14:16:24.094+0200: 4505875.250: [CMS-concurrent-abortable-preclean-start]
2015-09-24T14:16:25.347+0200: 4505876.503: [GC 4505876.503: [ParNew: 368744K->19384K(436928K), 0.0492700 secs] 3958788K->3609428K(5155520K), 0.0494530 secs] [Times: user=0.52 sys=0.00, real=0.05 secs]
CMS: abort preclean due to time 2015-09-24T14:16:29.105+0200: 4505880.261: [CMS-concurrent-abortable-preclean: 3.972/5.012 secs] [Times: user=4.87 sys=0.08, real=5.01 secs]
2015-09-24T14:16:29.109+0200: 4505880.265: [GC[YG occupancy: 123643 K (436928 K)]4505880.266: [Rescan (parallel) , 0.0643880 secs]4505880.330: [weak refs processing, 0.0000180 secs] [1 CMS-remark: 3590044K(4718592K)] 3713687K(5155520K), 0.0645660 secs] [Times: user=0.68 sys=0.00, real=0.06 secs]
2015-09-24T14:16:29.175+0200: 4505880.331: [CMS-concurrent-sweep-start]
2015-09-24T14:16:32.406+0200: 4505883.562: [CMS-concurrent-sweep: 3.227/3.231 secs] [Times: user=3.35 sys=0.03, real=3.23 secs]
2015-09-24T14:16:32.406+0200: 4505883.562: [CMS-concurrent-reset-start]
2015-09-24T14:16:32.416+0200: 4505883.572: [CMS-concurrent-reset: 0.010/0.010 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2015-09-24T14:16:34.047+0200: 4505885.203: [GC [1 CMS-initial-mark: 3590040K(4718592K)] 3814265K(5155520K), 0.2704050 secs] [Times: user=0.27 sys=0.00, real=0.27 secs]
2015-09-16T23:18:46.554+0200: 3847217.710: [CMS-concurrent-mark-start]
2015-09-16T23:18:46.926+0200: 3847218.083: [Full GC 3847218.083: [CMS2015-09-16T23:18:50.249+0200: 3847221.405: [CMS-concurrent-mark: 3.688/3.695 secs] [Times: user=13.96 sys=0.31, real=3.70 secs]
(concurrent mode failure): 3073996K->3011216K(4718592K), 20.7183280 secs] 3348996K->3011216K(5155520K), [CMS Perm : 262143K->40538K(262144K)], 20.7185010 secs] [Times: user=29.87 sys=0.31, real=20.71 secs]
i am using java 1.6 version, Does CMS is leading to this high CPU issue?
JVM paremeters in application:
-server
-d64
-Xms5000M
-Xmx5000M
-XX:+DisableExplicitGC
-XX:NewSize=512M
-XX:MaxNewSize=512M
-XX:SurvivorRatio=4
-XX:PermSize=256M
-XX:MaxPermSize=256M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=65
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+CMSPermGenSweepingEnabled
-XX:MaxTenuringThreshold=30
I am not able to figure out the solution to this problem, is there any parameter i need to change in JVM?
(concurrent mode failure): 3073996K->3011216K(4718592K), 20.7183280 secs] 3348996K->3011216K(5155520K), [CMS Perm : 262143K->40538K(262144K)], 20.7185010 secs] [Times: user=29.87 sys=0.31, real=20.71 secs]
(concurrent mode failure): 3258153K->3197547K(4718592K), 17.8924530 secs] 3644714K->3197547K(5155520K), [CMS Perm : 262143K->40572K(262144K)], 17.8926620 secs] [Times: user=17.89 sys=0.01, real=17.89 secs]
(concurrent mode failure): 3439590K->3370903K(4718592K), 18.0448510 secs] 3548868K->3370903K(5155520K), [CMS Perm : 262143K->40526K(262144K)], 18.0450480 secs] [Times: user=17.94 sys=0.01, real=18.04 secs]
I suggest you add tag java, because you have a java specific problem with FullGC pauses. It always sounds like:
As the application runs for somedays, at some point of time the cpu
utilization reaches 100%, which leads to slow response from
application.
I hope this article can help you. But first of all, you need to add some logging to you java parameters:
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamp
Also, it would be helpful to print heap statistics:
-XX:+PrintClassHistogramAfterFullGC -XX:+PrintClassHistogramBeforeFullGC
Probably, you have a memory leak also :)

why not Full GC?

Eden is 8M,survivor1 and survivor2 is 2M totally,Old area is 10M.when created object alloc4,first Minor GC was triggered,and alloc1/alloc2/alloc3 were moved old area.when created alloc6,alloc4 was moved old area,alloc5 was moved survivor area.when created alloc7,Eden could't hold the alloc7,so it was moved old area,but old area hold alloc1/alloc2/alloc3/alloc4,9M,also could't hold alloc7,so old area should trigger Full GC,recycle the alloc1,alloc3.But why the 3rd GC NOT full gc but minor gc?
/**
* VM Args:-Xms20M -Xmx20M -Xmn10M -XX:SurvivorRatio=8 -XX:+PrintGCDetails
*
* #author yikebocai#gmail.com
* #since 2013-3-26
*
*/
public class Testjstat {
private static final int _1MB = 1024 * 1024;
public static void main(String[] args) throws InterruptedException {
byte[] alloc1 = new byte[2 * _1MB];
byte[] alloc2 = new byte[2 * _1MB];
byte[] alloc3 = new byte[1 * _1MB];
// first Minor GC
byte[] alloc4 = new byte[4 * _1MB];
byte[] alloc5 = new byte[_1MB / 4];
// second Minor GC
byte[] alloc6 = new byte[6 * _1MB];
alloc1 = null;
alloc3 = null;
// first Full GC
byte[] alloc7 = new byte[3 * _1MB];
}
}
The gc detail is :
[GC [DefNew: 5463K->148K(9216K), 0.0063046 secs] 5463K->5268K(19456K), 0.0063589 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
[GC [DefNew: 4587K->404K(9216K), 0.0046368 secs] 9707K->9620K(19456K), 0.0046822 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
[GC [DefNew: 6548K->6548K(9216K), 0.0000373 secs][Tenured: 9216K->6144K(10240K), 0.0124560 secs] 15764K->12692K(19456K), [Perm : 369K->369K(12288K)], 0.0126052 secs] [Times: user=0.00 sys=0.02, real=0.01 secs]
Heap
def new generation total 9216K, used 6712K [0x322a0000, 0x32ca0000, 0x32ca0000)
eden space 8192K, 81% used [0x322a0000, 0x3292e2a8, 0x32aa0000)
from space 1024K, 0% used [0x32aa0000, 0x32aa0000, 0x32ba0000)
to space 1024K, 0% used [0x32ba0000, 0x32ba0000, 0x32ca0000)
tenured generation total 10240K, used 9216K [0x32ca0000, 0x336a0000, 0x336a0000)
the space 10240K, 90% used [0x32ca0000, 0x335a0030, 0x335a0200, 0x336a0000)
compacting perm gen total 12288K, used 369K [0x336a0000, 0x342a0000, 0x376a0000)
the space 12288K, 3% used [0x336a0000, 0x336fc548, 0x336fc600, 0x342a0000)
ro space 10240K, 51% used [0x376a0000, 0x37bccf58, 0x37bcd000, 0x380a0000)
rw space 12288K, 54% used [0x380a0000, 0x38738f50, 0x38739000, 0x38ca0000)

Resources