How to monitor the IO queue depth - io

I am benchmarking databases above a NVMe SSD. I want to monitor the number of I/O request in the queue in this figure over time to see if the databases fully take advantage of the queues.
I have tried tools like iostat, but the avgqu-sz field is always zero. I think this may be becase NVMe SSD has a completely new storage stack rather than conventional devices (e.g., SATA SSD).

Solution:
cd /sys/kernel/debug/tracing/events/nvme/nvme_sq
# filter by disk name:
echo 'disk=="nvme0n1"' > filter
# enable the event:
echo 1 > enable
# check results from trace_pipe:
cat /sys/kernel/debug/tracing/trace_pipe
I suggest also enable /sys/kernel/debug/tracing/events/nvme/nvme_setup_cmd, then, you can briefly understand what is the nvme driver doing.
<idle>-0 [002] d.h. 2558.073405: nvme_sq: nvme0: disk=nvme0n1, qid=3, head=76, tail=76
systemd-udevd-3805 [002] .... 2558.073454: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=3, cmdid=48, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=104856608, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
<idle>-0 [002] d.h. 2558.073664: nvme_sq: nvme0: disk=nvme0n1, qid=3, head=77, tail=77
systemd-udevd-3805 [002] .... 2558.073704: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=3, cmdid=49, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=104856648, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
<idle>-0 [002] d.h. 2558.073899: nvme_sq: nvme0: disk=nvme0n1, qid=3, head=78, tail=78
systemd-udevd-3805 [002] .... 2558.073938: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=3, cmdid=50, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=104854512, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
<idle>-0 [002] d.h. 2558.074134: nvme_sq: nvme0: disk=nvme0n1, qid=3, head=79, tail=79
The explanation of each field in this output can be found here.

Related

Why would a `#NP` fault expecting an IDT entry at index 302 even be possible?

Been in the process of writing a kernel from scratch in Rust for some time now, and have had it open-sourced since August while attempting to fix some problems related to an AHCI driver write attempt. One problem that I can’t seem to find a solution to at all is this:
The IDT is only supposed to be 256 entries long. Why therefore is a handler function expected at entry 302, which is more IDT entries than is legally possible? And how does one go about mapping this properly?
Running QEMU with -d int produces this interrupt information:
100: v=97 e=0000 i=0 cpl=0 IP=0008:0000008000021b3c pc=0000008000021b3c SP=0010:fffff00007ffe8f0 env->regs[R_EAX]=000001807eb4a3e0
RAX=000001807eb4a3e0 RBX=0000000000000020 RCX=000fffffffffffff RDX=0000000000000000
RSI=000000000000001f RDI=00000180c1085100 RBP=0000000000000000 RSP=fffff00007ffe8f0
R8 =000000800008bcc8 R9 =0000000000000003 R10=000000007bf36000 R11=0000000000000001
R12=000000000a204000 R13=0000200000c00000 R14=00000180c1085100 R15=fffff00007ffec01
RIP=0000008000021b3c RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
CS =0008 0000000000000000 ffffffff 00af9b00 DPL=0 CS64 [-RA]
SS =0010 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =0010 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
FS =0020 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
GS =0028 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
TR =0030 00000080000aed3c 00000067 00008900 DPL=0 TSS64-avl
GDT= 00000080000aedb8 0000003f
IDT= 00000080000aee30 00000fff
CR0=80010033 CR2=0000000000000000 CR3=0000000000002000 CR4=00000668
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=0000000000000020 CCD=0000000000000000 CCO=LOGICL
EFER=0000000000000d00
check_exception old: 0xffffffff new 0xb
101: v=0b e=0972 i=0 cpl=0 IP=0008:0000008000021b3c pc=0000008000021b3c SP=0010:fffff00007ffe8f0 env->regs[R_EAX]=000001807eb4a3e0
RAX=000001807eb4a3e0 RBX=0000000000000020 RCX=000fffffffffffff RDX=0000000000000000
RSI=000000000000001f RDI=00000180c1085100 RBP=0000000000000000 RSP=fffff00007ffe8f0
R8 =000000800008bcc8 R9 =0000000000000003 R10=000000007bf36000 R11=0000000000000001
R12=000000000a204000 R13=0000200000c00000 R14=00000180c1085100 R15=fffff00007ffec01
RIP=0000008000021b3c RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0018 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
CS =0008 0000000000000000 ffffffff 00af9b00 DPL=0 CS64 [-RA]
SS =0010 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =0010 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
FS =0020 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
GS =0028 0000000000000000 ffffffff 00cf9300 DPL=0 DS [-WA]
LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
TR =0030 00000080000aed3c 00000067 00008900 DPL=0 TSS64-avl
GDT= 00000080000aedb8 0000003f
IDT= 00000080000aee30 00000fff
CR0=80010033 CR2=0000000000000000 CR3=0000000000002000 CR4=00000668
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=0000000000000020 CCD=0000000000000000 CCO=LOGICL
EFER=0000000000000d00
After a lengthy discussion in the comments on the OP, I finally figured out what the problem was: the IRQ that caused the fault was 0x97 (151). Re-indexing the AHCI interrupt handler to that specific index in the IDT solved the problem.
#MichaelPetch has filed a bug report against QEMU. QEMU Software emulation encodes the IDT descriptor index incorrectly when running in long mode. QEMU does work when using the -enable-kvm option.
As of February 6th 2023 this QEMU bug has had a patch committed; accepted; and the bug has been marked closed.

Spark 3.0 UTC to AKST conversion fails with ZoneRulesException: Unknown time-zone ID

I am not able to convert timestamps in UTC to AKST timezone in Spark 3.0. The same works in Spark 2.4. All other conversions work (to EST, PST, MST etc).
Appreciate any inputs on how to fix this error.
Below command:
spark.sql("select from_utc_timestamp('2020-10-01 11:12:30', 'AKST')").show
returns the error:
java.time.zone.ZoneRulesException: Unknown time-zone ID: AKST
Detailed log:
java.time.zone.ZoneRulesException: Unknown time-zone ID: AKST
at java.time.zone.ZoneRulesProvider.getProvider(ZoneRulesProvider.java:272)
at java.time.zone.ZoneRulesProvider.getRules(ZoneRulesProvider.java:227)
at java.time.ZoneRegion.ofId(ZoneRegion.java:120)
at java.time.ZoneId.of(ZoneId.java:411)
at java.time.ZoneId.of(ZoneId.java:359)
at java.time.ZoneId.of(ZoneId.java:315)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.getZoneId(DateTimeUtils.scala:62)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.fromUTCTime(DateTimeUtils.scala:833)
at org.apache.spark.sql.catalyst.expressions.FromUTCTimestamp.nullSafeEval(datetimeExpressions.scala:1299)
at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:552)
at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:457)
at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:52)
at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:45)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:321)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:321)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.applyFunctionIfChanged$1(TreeNode.scala:380)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:416)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:248)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:414)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:362)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDown$1(QueryPlan.scala:96)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:118)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:118)
at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:129)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:134)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.immutable.List.map(List.scala:298)
at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:134)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:139)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:248)
at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:139)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:96)
at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1.applyOrElse(expressions.scala:45)
at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1.applyOrElse(expressions.scala:44)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:321)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:321)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.applyFunctionIfChanged$1(TreeNode.scala:380)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:416)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:248)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:414)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:362)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.applyFunctionIfChanged$1(TreeNode.scala:380)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:416)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:248)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:414)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:362)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:310)
at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.apply(expressions.scala:44)
at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.apply(expressions.scala:43)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149)
at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:89)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:146)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:138)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:138)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:116)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:98)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:116)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:82)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:121)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:153)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:153)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:82)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:79)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$writePlans$4(QueryExecution.scala:217)
at org.apache.spark.sql.catalyst.plans.QueryPlan$.append(QueryPlan.scala:381)
at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$writePlans(QueryExecution.scala:217)
at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:227)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:96)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:207)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:88)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3653)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2737)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2944)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:301)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:338)
at org.apache.spark.sql.Dataset.show(Dataset.scala:864)
at org.apache.spark.sql.Dataset.show(Dataset.scala:823)
at org.apache.spark.sql.Dataset.show(Dataset.scala:832)
... 47 elided
Adding further to mck's answer. You are using the old Java data-time API zone's short IDs. According to this Databricks blog post A Comprehensive Look at Dates and Timestamps in Apache Spark™ 3.0, Spark migrated to the new API since version 3.0:
Since Java 8, the JDK has exposed a new API for date-time manipulation
and time zone offset resolution, and Spark migrated to this new API in
version 3.0. Although the mapping of time zone names to offsets has
the same source, IANA TZDB, it is implemented differently in Java 8
and higher versus Java 7.
You can verify it by opening spark-shell and list available zones like this:
import java.time.ZoneId
import scala.collection.JavaConverters._
ZoneId.SHORT_IDS.asScala.keys
//res0: Iterable[String] = Set(CTT, ART, CNT, PRT, PNT, PLT, AST, BST, CST, EST, HST, JST, IST, AGT, NST, MST, AET, BET, PST, ACT, SST, VST, CAT, ECT, EAT, IET, MIT, NET)
That said, you should not use abbreviations when you specify timezones, instead use area/city format. See Which three-letter time zone IDs are not deprecated?
Seems it can't understand AKST, but Spark 3 seems to understand America/Anchorage, which I suppose to have the timezone AKST:
spark.sql("select from_utc_timestamp('2020-10-01 11:12:30', 'America/Anchorage')").show

Is it possible to interpolate across a time series with influence from other columns?

I have a dataframe that contains missing data. I'm interested in exploring interpolation as a possible alternative to removing columns with missing data.
Below is a subset of the dataset. 'a_out' is outdoor temperature while 'b_in' etc. are temperatures from rooms in the same house.
a_out b_in c_in d_in e_in f_in
... ... ... ... ... ... ...
03/01/2016 6.51 17.71 15.15 14.04 15.27 16.32
04/01/2016 5.94 17.49 14.34 14.71
05/01/2016 6.74 17.57 14.80 15.18
06/01/2016 5.86 17.49 14.68 18.43 15.57
07/01/2016 5.18 17.18 14.02 14.88
08/01/2016 2.84 16.80 13.15 14.51 14.48
... ... ... ... ... ... ...
Might there be a way to interpolate the missing data, but with some weighting based on intact data in other columns? Perhaps 'cubic' interpolation could do the trick?
Thanks!

is it possible to exceed limit of bandwidth when using multi-core?

i'm testing bandwidth using fio benchmark tool.
here is my hardware spec
2 socket per 10cores
Kernel version : 4.8.17
intel SSD 750 series
cpu : Intel(R) Xeon(R) CPU E5-2650 v3 # 2.3GHZ , ssd : Intel Solid State Drive 750 series, 400GB, 20nm Intel NAND Flash Memory MLC. NVMe PCIe 3.0*4 ADD-In card.
I could invalidate the buffer/page cache parts of the files to be used prior to starting I/O when i made the fio file.
And i used O_DIRECT flag(non-buffered IO) to bypass the page cache and used linux native asynchronous I/O request.
when i test with one core, fio output says that
bandwidth which core0 received is 1516.7MB/s.
it doesnt exceed bandwidth limitation of intel SSD 750. it doesn't matter.
here is test1 code.
[global]
filename=/dev/nvme0n1
runtime=10
bs=4k
ioengine=libaio
direct=1
iodepth=64
invalidate=1
randrepeat=0
log_avg_msec=1000
time_based
thread=1
size=256m
[job1]
cpus_allowed=0
rw=randread
but, when i do this with 3cores, the total bandwidth of cores is exceeds
intel SSD 750 bandwidth limitation.
total amount of bandwidth of 3cores is about 3000MB/s.
according to intel SSD 750 spec, my intel SSD bandwidth limitation is 2200MB/s.
here is code of test2(3 cores)
[global]
filename=/dev/nvme0n1
runtime=10
bs=4k
ioengine=libaio
direct=1
iodepth=64
invalidate=1
randrepeat=0
log_avg_msec=1000
time_based
thread=1
size=256m
[job1]
cpus_allowed=0
rw=randread
[job2]
cpus_allowed=1
rw=randread
[job3]
cpus_allowed=2
rw=randread
i don't know how this is happened.
here is fio test output of test1
job1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.2.10
Starting 1 thread
job1: (groupid=0, jobs=1): err= 0: pid=6924: Mon Jan 29 20:14:33 2018
read : io=15139MB, bw=1513.8MB/s, iops=387516, runt= 10001msec
slat (usec): min=0, max=42, avg= 1.97, stdev= 1.12
clat (usec): min=5, max=1072, avg=162.70, stdev=20.17
lat (usec): min=6, max=1073, avg=164.74, stdev=20.39
clat percentiles (usec):
| 1.00th=[ 141], 5.00th=[ 145], 10.00th=[ 149], 20.00th=[ 151],
| 30.00th=[ 155], 40.00th=[ 157], 50.00th=[ 159], 60.00th=[ 161],
| 70.00th=[ 165], 80.00th=[ 169], 90.00th=[ 179], 95.00th=[ 211],
| 99.00th=[ 229], 99.50th=[ 262], 99.90th=[ 318], 99.95th=[ 318],
| 99.99th=[ 334]
lat (usec) : 10=0.01%, 20=0.01%, 50=0.02%, 100=0.03%, 250=99.35%
lat (usec) : 500=0.60%, 1000=0.01%
lat (msec) : 2=0.01%
cpu : usr=22.32%, sys=77.64%, ctx=102, majf=0, minf=421
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=3875556/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: io=15139MB, aggrb=1513.8MB/s, minb=1513.8MB/s, maxb=1513.8MB/s, mint=10001msec, maxt=10001msec
Disk stats (read/write):
nvme0n1: ios=3834624/0, merge=0/0, ticks=25164/0, in_queue=25184, util=99.61%
here is fio output of test2(3cores)
job1: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
job2: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
job3: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.2.10
Starting 3 threads
job1: (groupid=0, jobs=1): err= 0: pid=6968: Mon Jan 29 20:14:53 2018
read : io=10212MB, bw=1021.2MB/s, iops=261413, runt= 10001msec
slat (usec): min=1, max=140, avg= 2.49, stdev= 1.23
clat (usec): min=4, max=970, avg=241.78, stdev=138.10
lat (usec): min=7, max=972, avg=244.35, stdev=138.09
clat percentiles (usec):
| 1.00th=[ 17], 5.00th=[ 25], 10.00th=[ 33], 20.00th=[ 64],
| 30.00th=[ 135], 40.00th=[ 225], 50.00th=[ 306], 60.00th=[ 330],
| 70.00th=[ 346], 80.00th=[ 366], 90.00th=[ 390], 95.00th=[ 410],
| 99.00th=[ 438], 99.50th=[ 446], 99.90th=[ 474], 99.95th=[ 502],
| 99.99th=[ 668]
lat (usec) : 10=0.01%, 20=2.03%, 50=14.39%, 100=9.67%, 250=16.14%
lat (usec) : 500=57.71%, 750=0.05%, 1000=0.01%
cpu : usr=17.32%, sys=71.84%, ctx=182182, majf=0, minf=318
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=2614396/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=64
job2: (groupid=0, jobs=1): err= 0: pid=6969: Mon Jan 29 20:14:53 2018
read : io=10540MB, bw=1053.1MB/s, iops=269802, runt= 10001msec
slat (usec): min=1, max=35, avg= 1.93, stdev= 0.97
clat (usec): min=5, max=903, avg=234.55, stdev=139.14
lat (usec): min=7, max=904, avg=236.56, stdev=139.13
clat percentiles (usec):
| 1.00th=[ 16], 5.00th=[ 22], 10.00th=[ 30], 20.00th=[ 57],
| 30.00th=[ 112], 40.00th=[ 207], 50.00th=[ 298], 60.00th=[ 330],
| 70.00th=[ 346], 80.00th=[ 362], 90.00th=[ 386], 95.00th=[ 402],
| 99.00th=[ 426], 99.50th=[ 438], 99.90th=[ 462], 99.95th=[ 494],
| 99.99th=[ 628]
lat (usec) : 10=0.01%, 20=3.22%, 50=14.51%, 100=10.76%, 250=15.48%
lat (usec) : 500=55.97%, 750=0.05%, 1000=0.01%
cpu : usr=26.08%, sys=59.08%, ctx=377522, majf=0, minf=326
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=2698293/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=64
job3: (groupid=0, jobs=1): err= 0: pid=6970: Mon Jan 29 20:14:53 2018
read : io=10368MB, bw=1036.8MB/s, iops=265406, runt= 10001msec
slat (usec): min=1, max=102, avg= 2.48, stdev= 1.24
clat (usec): min=5, max=874, avg=238.10, stdev=139.10
lat (usec): min=7, max=877, avg=240.66, stdev=139.09
clat percentiles (usec):
| 1.00th=[ 18], 5.00th=[ 27], 10.00th=[ 39], 20.00th=[ 72],
| 30.00th=[ 113], 40.00th=[ 193], 50.00th=[ 290], 60.00th=[ 330],
| 70.00th=[ 350], 80.00th=[ 370], 90.00th=[ 398], 95.00th=[ 414],
| 99.00th=[ 442], 99.50th=[ 454], 99.90th=[ 474], 99.95th=[ 498],
| 99.99th=[ 628]
lat (usec) : 10=0.01%, 20=1.51%, 50=12.00%, 100=13.78%, 250=17.81%
lat (usec) : 500=54.84%, 750=0.05%, 1000=0.01%
cpu : usr=17.96%, sys=71.88%, ctx=170809, majf=0, minf=319
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=2654335/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: io=31121MB, aggrb=3111.9MB/s, minb=1021.2MB/s, maxb=1053.1MB/s, mint=10001msec, maxt=10001msec
Disk stats (read/write):
nvme0n1: ios=7883218/0, merge=0/0, ticks=1730536/0, in_queue=1763060, util=99.52%
Hmm...
#peter-cordes makes a good point about (device) cache. Doing a Google search returns https://www.techspot.com/review/984-intel-ssd-750-series/ which says the following:
Also onboard are five Micron D9PQL DRAM chips which are used as a 1.25GB cache and the specs say this is DDR3-1600 memory.
Given that you're restricting fio to working in the same 256MByte region for all threads it could well be all your I/O easily fits into the device's cache. There's no dedicated way of discarding a device's cache (as opposed to Linux's buffer cache) other than natural means though so I'd recommending making your working region dramatically bigger (e.g. 10s - 100s gigabytes) to reduce the odds of a thread's data being prefetched by another thread's accesses.
Additionally I would ask "what data did you put down on to the SSD before you read it back"? SSDs are typically "thin" in the sense they can be aware of regions that have never been written or where it has been told that a region has been explicitly discarded. Because of this reading from such regions means the SSD has little work to do and can return data extremely quickly (like what an OS does when you read from a hole in a sparse file). In "real life" it is rare that you choose to read something that you've never written, so doing such a thing will distort your results.

What does LOAD in nodetool status measure?

I am observing higher load on a Cassandra node (compared to other nodes in the ring) and I am looking for help interpreting this data. I have anonymized my IPs but the snipped below shows a comparison of "good" node 199 (load 14G) and "bad" node 159(load 25G):
nodetool status|grep -E '199|159'
UN XXXXX.159 25.2 GB 256 ? ffda4798-tokentoken XXXXXX
UN XXXXX.199 13.37 GB 256 ? c3a49dca-tokentoken YYYY
Note load is almost 2x on .159. Yet neither memory nor disk usage explain/support this:
.199 (low load box) data -- memory at about 34%, disk 50-60G:
top|grep apache_cassan
28950 root 20 0 24.353g 0.010t 1.440g S 225.3 34.2 25826:35 apache_cassandr
28950 root 20 0 24.357g 0.010t 1.448g S 212.4 34.2 25826:41 apache_cassandr
28950 root 20 0 24.357g 0.010t 1.452g S 219.7 34.3 25826:48 apache_cassandr
28950 root 20 0 24.357g 0.011t 1.460g S 250.5 34.3 25826:55 apache_cassandr
Filesystem Size Used Avail Use% Mounted on
/dev/sde1 559G 47G 513G 9% /cassandra/data_dir_a
/dev/sdf1 559G 63G 497G 12% /cassandra/data_dir_b
/dev/sdg1 559G 54G 506G 10% /cassandra/data_dir_c
/dev/sdh1 559G 57G 503G 11% /cassandra/data_dir_d
.159 (high load box) data -- memory at about 28%, disk 20-40G:
top|grep apache_cassan
25354 root 20 0 36.297g 0.017t 8.608g S 414.7 27.8 170:42.81 apache_cassandr
25354 root 20 0 36.302g 0.017t 8.608g S 272.2 27.8 170:51.00 apache_cassandr
25354 root 20 0 36.302g 0.017t 8.612g S 129.7 27.8 170:54.90 apache_cassandr
25354 root 20 0 36.354g 0.017t 8.625g S 94.1 27.8 170:57.73 apache_cassandr
Filesystem Size Used Avail Use% Mounted on
/dev/sde1 838G 17G 822G 2% /cassandra/data_dir_a
/dev/sdf1 838G 11G 828G 2% /cassandra/data_dir_b
/dev/sdg1 838G 35G 804G 5% /cassandra/data_dir_c
/dev/sdh1 838G 26G 813G 4% /cassandra/data_dir_d
TL;DR version -- what does nodetool status 'load' column actually measure/report
The nodetool status command provides the following information:
Status - U (up) or D (down)
Indicates whether the node is functioning or not.
Load - updates every 90 seconds
The amount of file system data under the cassandra data directory after excluding all content in the snapshots subdirectories. Because all SSTable data files are included, any data that is not cleaned up, such as TTL-expired cell or tombstoned data) is counted.
For more information go to nodetool status output description

Resources