I am reading from kafka topic which has 5 partitions. Since 5 cores are not sufficient to handle the load, I am doing repartitioning the input to 30. I have given 30 cores to my spark process with 6 cores on each executor. With this setup i was assuming that each executor will get 6 tasks. But more oftan than not we are seeing that one executor is getting 4 tasks and others are getting 7 tasks. It is skewing the processing time of our job.
Can someone help me understand why all the executor will not get equal number of tasks? Here is the executor metrics after job has run for 12 hours.
Address
Status
RDD Blocks
Storage Memory
Disk Used
Cores
Active Tasks
Failed Tasks
Complete Tasks
Total Tasks
Task Time (GC Time)
Input
Shuffle Read
Shuffle Write
ip1:36759
Active
7
1.6 MB / 144.7 MB
0.0 B
6
6
0
442506
442512
35.9 h (26 min)
42.1 GB
25.9 GB
24.7 GB
ip2:36689
Active
0
0.0 B / 128 MB
0.0 B
0
0
0
0
0
0 ms (0 ms)
0.0 B
0.0 B
0.0 B
ip5:44481
Active
7
1.6 MB / 144.7 MB
0.0 B
6
6
0
399948
399954
29.0 h (20 min)
37.3 GB
22.8 GB
24.7 GB
ip1:33187
Active
7
1.5 MB / 144.7 MB
0.0 B
6
5
0
445720
445725
35.9 h (26 min)
42.4 GB
26 GB
24.7 GB
ip3:34935
Active
7
1.6 MB / 144.7 MB
0.0 B
6
6
0
427950
427956
33.8 h (23 min)
40.5 GB
24.8 GB
24.7 GB
ip4:38851
Active
7
1.7 MB / 144.7 MB
0.0 B
6
6
0
410276
410282
31.6 h (24 min)
39 GB
23.9 GB
24.7 GB
If you see there is a skew in number of tasks completed by ip5:44481. I dont see abnormal GC activity as well.
What metrics should i be looking at to understand this skew?
UPDATE
Upon further debugging I can see that all the partitions are having unequal data. And all the tasks are given approx same number of records.
Executor ID
Address
Task Time
Total Tasks
Failed Tasks
Killed Tasks
Succeeded Tasks
Shuffle Read Size / Records
Blacklisted
0stdoutstderr
ip3:37049
0.8 s
6
0
0
6
600.9 KB / 272
FALSE
1stdoutstderr
ip1:37875
0.6 s
6
0
0
6
612.2 KB / 273
FALSE
2stdoutstderr
ip3:41739
0.7 s
5
0
0
5
529.0 KB / 226
FALSE
3stdoutstderr
ip2:38269
0.5 s
6
0
0
6
623.4 KB / 272
FALSE
4stdoutstderr
ip1:40083
0.6 s
7
0
0
7
726.7 KB / 318
FALSE
This is the stats of stage just after repartitioning. We can see that number of tasks are proportional to number of records. As a next step I am trying to see how the partition function is working.
Update 2:
The only explanation i have come across is spark uses round robin partitioning. And It is indepandently executed on each partition. For example if there are 5 records on node1 and 7 records on node2. Node1's round robin will distribute approximately 3 records to node1, and approximately 2 records to node2. Node2's round robin will distribute approximately 4 records to node1, and approximately 3 records to node2. So, there is the possibility of having 7 records on node1 and 5 records on node2, depending on the ordering of the nodes that is interpreted within the framework code for each individual node. source
NOTE:
if you notice the best performing guys are on same ip. Is it because after shuffling transferring data on same host is faster? compared to other ip?
Based on above data we can see that repartition is working fine, i.e. assigning equal number of records to 30 partitions, but the question is why is some executors getting more partition than others.
Related
I have a solaris box and im trying to know whether its running out of memory or if its stable.
below is the output of vmstat.
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr vc vc vc vc in sy cs us sy id
1 0 0 11426696 4603520 613 1477 449 6 6 0 0 78 22 28 29 8970 37714 22961 43 6 51
4 0 0 4975280 0 1747 3487 805 0 0 0 0 233 41 33 44 9558 53713 15845 74 8 18
4 0 0 4936944 0 933 1837 0 0 0 0 0 56 28 12 39 9317 46898 14648 82 7 11
5 0 0 4943080 0 1056 2806 805 0 0 0 0 103 21 18 18 9286 46900 14866 78 8 14
5 0 0 4942264 0 1088 2173 804 6 6 0 0 109 8 40 31 9927 56484 16495 84 8 8
3 0 0 4942520 0 308 1018 1756 3 3 0 0 166 87 29 44 10638 64146 21413 83 9 8
0 0 0 4942512 0 156 326 1740 0 0 0 0 370 12 33 52 11554 40375 21897 75 9 16
2 0 0 4947384 0 294 560 845 0 0 0 0 121 18 23 20 9445 52382 17016 77 6 17
I can see the free column shows 0 however the sr column also shows 0
And output from top command doesn't show how much free memory available. Swap shows 0.0%
load averages: 11.4, 9.12, 9.24;
9021 processes: 9018 sleeping, 1 running, 2 on cpu
CPU states: 0.0% idle, 71.4% user, 28.6% kernel, 0.0% iowait, 0.0% swap
Memory: 24G phys mem, 16G total swap, 13G free swap
Am i running out of RAM?
Please suggest how to interpret this data. Do i need to increase my physical memory?
Appreciate some insights.
From the Solaris 11.4 vmstat man page, there's one important thing to note:
Without options, vmstat displays a one-line summary of the virtual memory activity since the system was booted.
That also applies to the first line of output from Solaris vmstat: it's a summary of all activity since the system was booted.
A good description of the output fields is found in the EXAMPLES section of the Solaris man vmstat page:
Examples
Example 1 Using vmstat
The following command displays a summary of what the system is doing
every five seconds.
example% vmstat 5
kthr memory page disk faults cpu
r b w swap free re mf pi p fr de sr s0 s1 s2 s3 in sy cs us sy id
0 0 0 11456 4120 1 41 19 1 3 0 2 0 4 0 0 48 112 130 4 14 82
0 0 1 10132 4280 0 4 44 0 0 0 0 0 23 0 0 211 230 144 3 35 62
0 0 1 10132 4616 0 0 20 0 0 0 0 0 19 0 0 150 172 146 3 33 64
0 0 1 10132 5292 0 0 9 0 0 0 0 0 21 0 0 165 105 130 1 21 78
1 1 1 10132 5496 0 0 5 0 0 0 0 0 23 0 0 183 92 134 1 20 79
1 0 1 10132 5564 0 0 25 0 0 0 0 0 18 0 0 131 231 116 4 34 62
1 0 1 10124 5412 0 0 37 0 0 0 0 0 22 0 0 166 179 118 1 33 67
1 0 1 10124 5236 0 0 24 0 0 0 0 0 14 0 0 109 243 113 4 56 39
example%
The fields of vmstat's display are
kthr
Report the number of kernel threads in each of the three following
states:
r
the number of kernel threads in run queue
b
the number of blocked kernel threads that are waiting for
resources I/O, paging, and so forth
w
the number of swapped out lightweight processes (LWPs) that
are waiting for processing resources to finish.
memory
Report on usage of virtual and real memory.
swap
available swap space (Kbytes)
free
size of the free list (Kbytes)
page
Report information about page faults and paging activity. The
information on each of the following activities is given in units per
second.
re
page reclaims — but see the –S option for how this field is modified.
mf
minor faults — but see the –S option for how this field is modified.
pi
kilobytes paged in
po
kilobytes paged out
fr
kilobytes freed
de
anticipated short-term memory shortfall (Kbytes)
sr
pages scanned by clock algorithm
When executed in a zone and if the pools facility is active, all of
the above (except for ‘de’) only report activity on the processors in
the processor set of the zone's pool.
disk
Report the number of disk operations per second. There are slots for
up to four disks, labeled with a single letter and number. The letter
indicates the type of disk (s = SCSI, i = IPI, and so forth); the
number is the logical unit number.
faults
Report the trap/interrupt rates (per second).
in
interrupts
sy
system calls
cs
CPU context switches
When executed in a zone and if the pools facility is active, all of
the above only report activity on the processors in the processor set
of the zone's pool.
cpu
Give a breakdown of percentage usage of CPU time. On MP systems, this
is an average across all processors.
us
user time
sy
system time
id
idle time
When executed in a zone and if the pools facility is active, all of
the above only report activity on the processors in the processor set
of the zone's pool.
This can help you https://www.howtogeek.com/424334/how-to-use-the-vmstat-command-on-linux/. There is explanation of those shorts.
Memory
swpd: the amount of virtual memory used. In other words, how much memory has been swapped out.,
free: the amount of idle (currently unused) memory.
buff: the amount of memory used as buffers.
cache: the amount of memory used as cache.
Swap
si: Amount of virtual memory swapped in from swap space.
so: Amount of virtual memory swapped out to swap space.
IO
bi: Blocks received from a block device. The number of data blocks used to swap virtual memory back into RAM.
bo: Blocks sent to a block device. The number of data blocks used to swap virtual memory out of RAM and into swap space.
System
in: The number of interrupts per second, including the clock.
cs: The number of context switches per second. A context switch is when the kernel swaps from system mode processing into user mode processing.
"0" is not a valid free memory value.
By design, Solaris always makes sure a minimal amount of free memory is available. The fact the sr column is also equals to zero suggests there is no memory shortage. In any case, you wouldn't have been able to run vmstat or top in the first place with such an extreme RAM shortage.
You should investigate further to understand why the free memory is reported a zero. mdb's ::memstat command would be a good start:
# echo "::memstat" | mdb -k
I am using cassandra v2.1.13 in 10 node cluster with replication factor = 2 and LeveledCompactionStrategy. The nodes are c4.4x large types with 10g heap.
8 nodes in the cluster are performing fine but there is an issue with 2 particular nodes in the cluster. These nodes are constantly giving very poor read latency, and have a high dropped reads count. There is a continuous high OS load on these 2 nodes.
Below is the nodetool commands results from one of these nodes:
status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.0.23.37 257.31 GB 256 19.0% 3e9ee62e-70a2-4b2e-ba10-290a62cd055b 1
UN 10.0.53.69 300.24 GB 256 20.5% 48988162-69d6-4698-9afa-799ef4be7bbc 2
UN 10.0.23.133 342.37 GB 256 21.1% 30431a62-0cf6-4c82-8af1-e9ba0025eba6 1
UN 10.0.53.7 348.52 GB 256 21.4% 5fcdeb25-e1e5-47f6-af7f-7bea825ab3c0 2
UN 10.0.53.88 292.59 GB 256 19.5% c77904bc-10a8-49e0-b6fa-8fe8126e064c 2
UN 10.0.53.250 272.76 GB 256 20.6% ecf417f2-2e96-4b9e-bb15-06eaf948cefa 2
UN 10.0.23.75 271.24 GB 256 20.8% d8b0ab1b-65ab-46cd-b7e4-3fb3861ffb23 1
UN 10.0.23.253 302.9 GB 256 21.0% 4bb6408a-9aa0-42da-96f7-dbe0dad757bc 1
UN 10.0.23.238 326.35 GB 256 18.2% 55e33a97-e5ca-4c48-a530-a0ff6fa8edde 1
UN 10.0.53.222 247.4 GB 256 18.0% c3a6e4c2-7ab6-4f3a-a444-8d6dff2beb43 2
cfstats
Keyspace: key_space_name
Read Count: 63815118
Read Latency: 88.71912845022085 ms.
Write Count: 40802728
Write Latency: 1.1299861338192878 ms.
Pending Flushes: 0
Table: table1
SSTable count: 1269
SSTables in each level: [1, 10, 103/100, 1023/1000, 131, 0, 0, 0, 0]
Space used (live): 274263401275
Space used (total): 274263401275
Space used by snapshots (total): 0
Off heap memory used (total): 1776960519
SSTable Compression Ratio: 0.3146938954387242
Number of keys (estimate): 1472406972
Memtable cell count: 3840240
Memtable data size: 96356032
Memtable off heap memory used: 169478569
Memtable switch count: 149
Local read count: 47459095
Local read latency: 0.328 ms
Local write count: 14501016
Local write latency: 0.695 ms
Pending flushes: 0
Bloom filter false positives: 186
Bloom filter false ratio: 0.00000
Bloom filter space used: 1032396536
Bloom filter off heap memory used: 1032386384
Index summary off heap memory used: 495040742
Compression metadata off heap memory used: 80054824
Compacted partition minimum bytes: 216
Compacted partition maximum bytes: 3973
Compacted partition mean bytes: 465
Average live cells per slice (last five minutes): 0.1710125211397823
Maximum live cells per slice (last five minutes): 1.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0
Table: table2
SSTable count: 93
SSTables in each level: [1, 10, 82, 0, 0, 0, 0, 0, 0]
Space used (live): 18134115541
Space used (total): 18134115541
Space used by snapshots (total): 0
Off heap memory used (total): 639297085
SSTable Compression Ratio: 0.2889927549599339
Number of keys (estimate): 102804187
Memtable cell count: 409595
Memtable data size: 492365311
Memtable off heap memory used: 529339207
Memtable switch count: 433
Local read count: 16357463
Local read latency: 345.194 ms
Local write count: 26302779
Local write latency: 1.370 ms
Pending flushes: 0
Bloom filter false positives: 4
Bloom filter false ratio: 0.00000
Bloom filter space used: 73133360
Bloom filter off heap memory used: 73132616
Index summary off heap memory used: 30985070
Compression metadata off heap memory used: 5840192
Compacted partition minimum bytes: 125
Compacted partition maximum bytes: 24601
Compacted partition mean bytes: 474
Average live cells per slice (last five minutes): 0.9915609172937249
Maximum live cells per slice (last five minutes): 1.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0
tpstats
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 2 0 43617272 0 0
ReadStage 45 189 64039921 0 0
RequestResponseStage 0 0 37790267 0 0
ReadRepairStage 0 0 3590974 0 0
CounterMutationStage 0 0 0 0 0
MiscStage 0 0 0 0 0
HintedHandoff 0 0 18 0 0
GossipStage 0 0 457469 0 0
CacheCleanupExecutor 0 0 0 0 0
InternalResponseStage 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
CompactionExecutor 1 1 60965 0 0
ValidationExecutor 0 0 646 0 0
MigrationStage 0 0 0 0 0
AntiEntropyStage 0 0 1938 0 0
PendingRangeCalculator 0 0 17 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 1997 0 0
MemtablePostFlush 0 0 4884 0 0
MemtableReclaimMemory 0 0 1997 0 0
Native-Transport-Requests 11 0 61321377 0 0
Message type Dropped
READ 56788
RANGE_SLICE 0
_TRACE 0
MUTATION 182
COUNTER_MUTATION 0
BINARY 0
REQUEST_RESPONSE 0
PAGED_RANGE 0
READ_REPAIR 1
netstats
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 2910144
Mismatch (Blocking): 0
Mismatch (Background): 229431
Pool Name Active Pending Completed
Commands n/a 0 37851777
Responses n/a 0 101128958
compactionstats
pending tasks: 0
I verified that there is no continuous compaction running.
Upon taking the jstack on both the nodes, there were threads that were continuously calling the following stack in infinite loop - my guess is that the high os load is caused by these threads being stuck in loop and leading to slow reads.
SharedPool-Worker-7 - priority:5 - threadId:0x00002bb0356a0150 - nativeId:0x158ef - state:RUNNABLE
stackTrace:
java.lang.Thread.State: RUNNABLE
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.cassandra.utils.memory.HeapAllocator.allocate(HeapAllocator.java:34)
at org.apache.cassandra.utils.memory.AbstractAllocator.clone(AbstractAllocator.java:34)
at org.apache.cassandra.db.NativeCell.localCopy(NativeCell.java:58)
at org.apache.cassandra.db.CollationController$2.apply(CollationController.java:223)
at org.apache.cassandra.db.CollationController$2.apply(CollationController.java:220)
at com.google.common.collect.Iterators$8.transform(Iterators.java:794)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
at org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:175)
at org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:156)
at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146)
at org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125)
at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:264)
at org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:108)
at org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:82)
at org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:69)
at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:314)
at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:2001)
at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1844)
at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:353)
at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:85)
at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:47)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164)
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
at java.lang.Thread.run(Thread.java:748)
We tried increasing the instance size of the affected nodes, suspecting that they might be blocked on I/O ops but that did not help.
Can someone pls help me figuring out what is wrong with these 2 nodes in of the cluster.
I have an Azure VM with 2 cores. From my understanding, the CPU % returned by docker stats can be greater than 100% if multiple cores are used. So, this should max out at 200% for this VM. However, I get results like this with CPU % greater than 1000%
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
545d4c69028f 3.54% 94.39 MiB / 6.803 GiB 1.35% 3.36 MB / 1.442 MB 1.565 MB / 5.673 MB 6
008893e3f70c 625.00% 191.3 MiB / 6.803 GiB 2.75% 0 B / 0 B 0 B / 24.58 kB 35
f49c94dc4567 0.10% 46.85 MiB / 6.803 GiB 0.67% 2.614 MB / 5.01 MB 61.44 kB / 0 B 31
08415d81c355 0.00% 28.76 MiB / 6.803 GiB 0.41% 619.1 kB / 3.701 MB 0 B / 0 B 11
03f54d35a5f8 1.04% 136.5 MiB / 6.803 GiB 1.96% 83.94 MB / 7.721 MB 0 B / 0 B 22
f92faa7321d8 0.15% 19.29 MiB / 6.803 GiB 0.28% 552.5 kB / 758.6 kB 0 B / 2.798 MB 7
2f4a27cc3e44 0.07% 303.8 MiB / 6.803 GiB 4.36% 32.52 MB / 20.27 MB 2.195 MB / 0 B 11
ac96bc45044a 0.00% 19.34 MiB / 6.803 GiB 0.28% 37.28 kB / 12.76 kB 0 B / 3.633 MB 7
7c1a45e92f52 2.20% 356.9 MiB / 6.803 GiB 5.12% 86.36 MB / 156.2 MB 806.9 kB / 0 B 16
0bc4f319b721 14.98% 101.8 MiB / 6.803 GiB 1.46% 138.1 MB / 64.33 MB 0 B / 73.74 MB 75
66aa24598d27 2269.46% 1.269 GiB / 6.803 GiB 18.65% 1.102 GB / 256.4 MB 14.34 MB / 3.412 MB 50
I can verify there are only two cores:
$ grep -c ^processor /proc/cpuinfo
2
The output of lshw -short is also confusing to me:
H/W path Device Class Description
=====================================================
system Virtual Machine
/0 bus Virtual Machine
/0/0 memory 64KiB BIOS
/0/5 processor Intel(R) Xeon(R) CPU E5-2673 v3 # 2.40GHz
/0/6 processor Xeon (None)
/0/7 processor (None)
/0/8 processor (None)
/0/9 processor (None)
/0/a processor (None)
/0/b processor (None)
/0/c processor (None)
/0/d processor (None)
/0/e processor (None)
/0/f processor (None)
/0/10 processor (None)
...
with well over 50 processors listed
For your first question, I would suggest you to submit an issue on this page.
The output of lshw -short is also confusing to me:
If you omit the "-short" parameter, you will find that all of the "processor (None)" is in the state of DISABLED.
I have a spark sql query that look like this:
val stats = readings
.join(stations, $"r.stationId" === $"i.stationId")
.groupBy($"i.country", $"r.date")
.agg(min($"r.temp").as("min"), max($"r.temp").as("max"), avg($"r.temp").as("avg"))
.as[CountryTemperatureDistribution]
When I run this over my dataset some of the tasks in the longest running stage take far longer to run than others. For example the summary in the spark ui is:
Index ID Executor ID Launch Time Duration GC Time Input Sze / Rec Write Time Shuffle Write Size / Rec
2 5 0 / 10.1.0.6 12:36:38 12 s 1 s 14.4 MB / 9287283 2 s 3.7 MB / 138847
3 6 2 / 10.1.0.4 12:36:38 17 s 2 s 0.0 B / 9273044 0.4 s 3.8 MB / 141888
5 8 0 / 10.1.0.6 12:36:38 22 s 1 s 14.5 MB / 9064853 1 s 5.6 MB / 206393
0 3 2 / 10.1.0.4 12:36:38 23 s 2 s 0.0 B / 9290667 0.3 s 4.1 MB / 157233
6 9 2 / 10.1.0.4 12:36:38 23 s 2 s 0.0 B / 9233245 0.2 s 4.1 MB / 158441
1 4 1 / 10.1.0.5 12:36:38 24 s 3 s 14.5 MB / 9289969 1 s 3.9 MB / 149370
4 7 1 / 10.1.0.5 12:36:38 24 s 3 s 14.8 MB / 9250565 1 s 4.0 MB / 154438
10 13 1 / 10.1.0.5 12:36:38 1.5 min 3 s 14.1 MB / 8734757 0.3 s 6.3 MB / 234829
12 15 0 / 10.1.0.6 12:36:50 2.3 min 88 ms 7.3 MB / 4338998 0.5 s 4.8 MB / 177381
9 12 2 / 10.1.0.4 12:36:38 3.1 min 2 s 0.0 B / 8242837 0.1 s 8.2 MB / 301212
8 11 0 / 10.1.0.6 12:36:38 7.3 min 2 s 13.1 MB / 8346974 0.4 s 7.6 MB / 277488
11 14 0 / 10.1.0.6 12:36:38 7.6 min 2 s 13.0 MB / 7950142 0.1 s 7.3 MB / 262211
7 10 1 / 10.1.0.5 12:36:38 8.4 min 3 s 15.2 MB / 9347698 0.1 s 8.2 MB / 292230
based on this some tasks complete in seconds, while others take several minutes. I'm trying to rule out data skew (my group by shouldn't produce skewed data), but if this was data skew presumably the slower jobs would have a much a larger input size than the faster ones (which they don't).
Another oddity is that sever jobs (even ones which take several minutes to complete) have an input size of 0B but a non-zero input records?!
TLDR: why do some of my taks take much longer to compete than others?
I am trying to configure and benchmark my AWS EC2 instances for Cassandra distributions with Datstax Community Edition. I'm working with 1 cluster so far, and I'm having issues with the horizontal scaling.
I'm running cassandra-stress tool to stress the nodes and I'm not seeing the horizontal scaling. My command is run under an EC2 instance that is on the same network as the nodes but not on the node (ie i'm not using one of the node to launch the command)
I have inputted the following:
cassandra-stress write n=1000000 cl=one -mode native cql3 -schema keyspace="keyspace1" -pop seq=1..1000000 -node ip1,ip2
I started with 2 nodes, and then 3, and then 6. But the numbers don't tell me what Cassandra is suppose to do: more nodes to a cluster should speed up read/write.
Results: 2 Nodes: 1M 3 Nodes: 1M 3 Nodes: 2M 6 Nodes: 1M 6 Nodes: 2M 6 Nodes: 6M 6 Nodes: 10M
op rate 6858 6049 6804 7711 7257 7531 8081
partition rate 6858 6049 6804 7711 7257 7531 8081
row rate 6858 6049 6804 7711 7257 7531 8081
latency mean 29.1 33 29.3 25.9 27.5 26.5 24.7
latency median 24.9 32.1 24 22.6 23.1 21.8 21.5
latency 95th percentile 57.9 73.3 62 50 56.2 52.1 40.2
latency 99th percentile 76 92.2 77.4 65.3 69.1 61.8 46.4
latency 99.9th percentile 87 103.4 83.5 76.2 75.7 64.9 48.1
latency max 561.1 587.1 1075 503.1 521.7 1662.3 590.3
total gc count 0 0 0 0 0 0 0
total gc mb 0 0 0 0 0 0 0
total gc time (s) 0 0 0 0 0 0 0
avg gc time(ms) NAN NaN NaN NaN NaN NaN NaN
stdev gc time(ms) 0 0 0 0 0 0 0
Total operation time 0:02:25 0:02:45 0:04:53 0:02:09 00:04.35 0:13:16 0:20:37
Each with the default keyspace1 that was provided.
I've tested at 3 Nodes: 1M, 2M iteration. 6 Nodes I've tried 1M,2M, 6M, and 10M. As I increased Iteration I'm marginally increasing the OP Rate.
Am I doing something wrong or do I have Cassandra backward. Right now RF = 1 as I don't want to insert latency for replications. I Just want to see in the longterm the horizontal scaling which I'm not seeing it.
Help?