High value of shm_flushes in Varnish 3.0.5 - varnish

We've a very hight value of shm_flushes in varnishstat:
Hitrate ratio: 0 0 0
Hitrate avg: 0.0000 0.0000 0.0000
3900996 90.85 inf backend_busy - Backend conn. too many
12498808 228.62 inf backend_reuse - Backend conn. reuses
9950893 186.69 inf backend_toolate - Backend conn. was closed
35649 0.00 inf backend_recycle - Backend conn. recycles
733236 10.98 inf backend_retry - Backend conn. retry
94911 0.00 inf fetch_head - Fetch head
468 0.00 inf fetch_eof - Fetch EOF
1962214 25.96 inf fetch_bad - Fetch had bad headers
16922 0.00 inf fetch_close - Fetch wanted close
1979182 34.94 inf fetch_oldhttp - Fetch pre HTTP/1.1 closed
26 0.00 inf fetch_zero - Fetch zero len
927 0.00 inf fetch_failed - Fetch failed
1260014 29.95 inf fetch_1xx - Fetch no body (1xx)
332729 5.99 inf fetch_204 - Fetch no body (204)
7341 . . n_sess - N struct sess
112 . . n_objectcore - N struct objectcore
456054 . . n_vbc - N struct vbc
762 . . n_wrk - N worker threads
427 -40.93 inf n_wrk_create - N worker threads created
368860 -2.00 inf n_wrk_failed - N worker threads not created
368974 2.00 inf n_wrk_lqueue - work request queue length
361762 11.98 inf n_wrk_queued - N queued work requests
115 0.00 inf n_wrk_drop - N dropped work requests
42 . . n_backend - N backends
115 . . n_expired - N expired objects
2747 . . n_lru_nuked - N LRU nuked objects
18073 0.00 inf n_objwrite - Objects sent with write
14 0.00 inf s_sess - Total Sessions
364370 12.98 inf s_req - Total Requests
7843107 119.80 inf s_pass - Total pass
9802175 187.68 inf s_bodybytes - Total body bytes
3900985 108.82 inf sess_pipeline - Session Pipeline
12498808 228.62 inf sess_readahead - Session Read Ahead
6 0.00 inf sess_linger - Session Linger
1323887 26.95 inf sess_herd - Session herd
2056953 37.94 inf shm_records - SHM records
4873964594 91989.09 inf shm_writes - SHM writes
304821503001 6421550.43 inf shm_flushes - SHM flushes due to overflow
1057284 31.95 inf shm_cont - SHM MTX contention
102 0.00 inf shm_cycles - SHM cycles through buffer
94 0.00 inf sms_nreq - SMS allocator requests
11853499 . . sms_nobj - SMS outstanding allocations
10496270 . . sms_nbytes - SMS outstanding bytes
655590062 . . sms_balloc - SMS bytes allocated
38937693 . . sms_bfree - SMS bytes freed
211416 2.00 inf n_vcl - N vcl total
302 0.00 inf n_vcl_avail - N vcl available
24886 0.00 inf n_vcl_discard - N vcl discarded
10704338 0.00 inf n_ban_add - N new bans added
10704338 0.00 inf n_ban_retire - N old bans deleted
2057082 25.96 inf n_ban_obj_test - N objects tested
3 0.00 inf n_ban_re_test - N regexps tested against
3 0.00 inf n_ban_dups - N duplicate bans removed
2 0.00 inf hcb_lock - HCB Lookups with lock
1 0.00 inf hcb_insert - HCB Inserts
2 0.00 inf esi_errors - ESI parse errors (unlock)
535 0.00 inf accept_fail - Accept failures
535 0.00 inf client_drop_late - Connection dropped late
10718794 175.70 inf dir_dns_lookups - DNS director lookups
610318 4.99 inf dir_dns_failed - DNS director failed lookups
610316 4.99 inf dir_dns_hit - DNS director cached lookups hit
also the hitrate is shown 0, but we have definitely cache hits in Varnish as we can see in the responses and varnishlog
We tried -p shm_workspace=16384 in DAEMON_OPTS but shm_flushes still increasing.
Shouldn't be the value of shm_flushes nearly 0?

This listing/output doesn't make sense.
shm_flushes should be in the same order as shm_records. 6.4 million per second clearly isn't.
I suspect you're using the wrong version of libvarnishapi.

I would be more concerned about inf backend_busy - Backend conn. too many in the first place. And also that you Hitrate ratio is 0 0 0.
I had similar problem with stats not reporting correctly: for some reason I had libvarnishapi-dev instead of libvarnishapi1 lib installed and also probably the wrong version.
After I did sudo app-install libvarnishapi1 and reboot I don't get backend_busy - Backend conn. too many any more and my clocks are also ticking now in varnishstat.
When you get there, see if you shm_flushes are still too high.

Related

Perf stat HW counters

perf stat ./myapp
and the result must be like this (it's just an example)
Performance counter stats for 'myapp':
83723.452481 task-clock:u (msec) # 1.004 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
3,228,188 page-faults:u # 0.039 M/sec
229,570,665,834 cycles:u # 2.742 GHz
313,163,853,778 instructions:u # 1.36 insn per cycle
69,704,684,856 branches:u # 832.559 M/sec
2,078,861,393 branch-misses:u # 2.98% of all branches
83.409183620 seconds time elapsed
74.684747000 seconds user
8.739217000 seconds sys
Perf stat prints user time and system time, and the HW counter will be incremented whatever application the cpu executes.
For HW counters like cycles or instructions, does the perf count them only for "myapp"?
For instance, (cs for context switch)
|--------------------|-------|-------------------|------------|------------------|
myapp cs cs myapp cs cs end
inst 0 10 20 50 80 100
60 instructions for "myapp" , but the value of HW counter is 100, then does the perf stat prints out 60?

Perf output is less than the number of actual instruction

I tried to count the number of instructions of add loop application in RISC-V FPGA, using very simple RV32IM core with Linux 5.4.0 buildroot.
add.c:
int main()
{
int a = 0;
for (int i = 0; i < 1024*1024; i++)
a++;
printf("RESULT: %d\n", a);
return a;
}
I used -O0 compile option so that the loop really loop, and the resulting dump file is following:
000103c8 <main>:
103c8: fe010113 addi sp,sp,-32
103cc: 00812e23 sw s0,28(sp)
103d0: 02010413 addi s0,sp,32
103d4: fe042623 sw zero,-20(s0)
103d8: fe042423 sw zero,-24(s0)
103dc: 01c0006f j 103f8 <main+0x30>
103e0: fec42783 lw a5,-20(s0)
103e4: 00178793 addi a5,a5,1 # 12001 <__TMC_END__+0x1>
103e8: fef42623 sw a5,-20(s0)
103ec: fe842783 lw a5,-24(s0)
103f0: 00178793 addi a5,a5,1
103f4: fef42423 sw a5,-24(s0)
103f8: fe842703 lw a4,-24(s0)
103fc: 001007b7 lui a5,0x100
10400: fef740e3 blt a4,a5,103e0 <main+0x18>
10404: fec42783 lw a5,-20(s0)
10408: 00078513 mv a0,a5
1040c: 01c12403 lw s0,28(sp)
10410: 02010113 addi sp,sp,32
10414: 00008067 ret
As you can see, the application loops from 103e0 ~ 10400, which is 9 instructions, so the number of total instruction must be at least 9 * 1024^2
But the result of perf stat is pretty weird
RESULT: 1048576
Performance counter stats for './add.out':
3170.45 msec task-clock # 0.841 CPUs utilized
20 context-switches # 0.006 K/sec
0 cpu-migrations # 0.000 K/sec
38 page-faults # 0.012 K/sec
156192046 cycles # 0.049 GHz (11.17%)
8482441 instructions # 0.05 insn per cycle (11.12%)
1145775 branches # 0.361 M/sec (11.25%)
3.771031341 seconds time elapsed
0.075933000 seconds user
3.559385000 seconds sys
The total number of instructions perf counted was lower than 9 * 1024^2. Difference is about 10%.
How is this happening? I think the output of perf should be larger than that, because perf tool measures not only overall add.out, but also overhead of perf itself and context-switching.

Nodejs delay/interrupt in for loop

I want to write a logger (please no comments why and "use ...")
But I am confused with the nodejs (event?) loop/forEach.
As example:
for(var i = 0; i<100; i++){
process.stdout.write(Date.now().toString() + "\n", "utf8");
};
output as: 1466021578453, 1466021578453, 1466021578469, 1466021578469
Questions: Where comes the Delay from 16ms; And how can I prevent that?
EDIT: Windows 7, x64; (Delay on Ubuntu 15, max 2ms)
sudo ltrace -o outlog node myTest.js
This is likely more than you want. The call Date.now() translates into on my machine is clock_gettime. You want to look at the stuff between subsequent calls to clock_gettime. You're also writing out to STDOUT, each time you do that there is overhead. You can run the whole process under ltrace to see what's happening and get a summary with -c.
For me, it runs in 3 ms when not running it under ltrace.
% time seconds usecs/call calls function
------ ----------- ----------- --------- --------------------
28.45 6.629315 209 31690 memcpy
26.69 6.219529 217 28544 memcmp
16.78 3.910686 217 17990 free
9.73 2.266705 214 10590 malloc
2.92 0.679971 220 3083 _Znam
2.86 0.666421 216 3082 _ZdaPv
2.55 0.593798 206 2880 _ZdlPv
2.16 0.502644 211 2378 _Znwm
1.09 0.255114 213 1196 strlen
0.69 0.161741 215 750 pthread_getspecific
0.67 0.155609 209 744 memmove
0.57 0.133857 212 631 _ZNSo6sentryC1ERSo
0.57 0.133344 226 589 pthread_mutex_lock
0.52 0.121342 206 589 pthread_mutex_unlock
0.46 0.106343 207 512 clock_gettime
0.40 0.093022 204 454 memset
0.39 0.089857 216 416 _ZNSt9basic_iosIcSt11char_traitsIcEE4initEPSt15basic_streambufIcS1_E
0.22 0.050741 195 259 strcmp
0.20 0.047454 228 208 _ZNSt8ios_baseC2Ev
0.20 0.047236 227 208 floor
0.19 0.044603 214 208 _ZNSt6localeC1Ev
0.19 0.044536 212 210 _ZNSs4_Rep10_M_destroyERKSaIcE
0.19 0.044200 212 208 _ZNSt8ios_baseD2Ev
I'm not sure why there are 31,690 memcpy's in there and 28544 memcmp. That seems a bit excessive but perhaps that just the JIT start up cost, as for the runtime cost, you can see there are 512 calls to clock_gettime. No idea why there at that many calls either, but you can see 106ms lost in clock_gettime. Good luck with it.

Decipher garbage collection output

I was running a sample program program using
rahul#g3ck0:~/programs/Remodel$ GOGCTRACE=1 go run main.go
gc1(1): 0+0+0 ms 0 -> 0 MB 422 -> 346 (422-76) objects 0 handoff
gc2(1): 0+0+0 ms 0 -> 0 MB 2791 -> 1664 (2867-1203) objects 0 handoff
gc3(1): 0+0+0 ms 1 -> 0 MB 4576 -> 2632 (5779-3147) objects 0 handoff
gc4(1): 0+0+0 ms 1 -> 0 MB 3380 -> 2771 (6527-3756) objects 0 handoff
gc5(1): 0+0+0 ms 1 -> 0 MB 3511 -> 2915 (7267-4352) objects 0 handoff
gc6(1): 0+0+0 ms 1 -> 0 MB 6573 -> 2792 (10925-8133) objects 0 handoff
gc7(1): 0+0+0 ms 1 -> 0 MB 4859 -> 3059 (12992-9933) objects 0 handoff
gc8(1): 0+0+0 ms 1 -> 0 MB 4554 -> 3358 (14487-11129) objects 0 handoff
gc9(1): 0+0+0 ms 1 -> 0 MB 8633 -> 4116 (19762-15646) objects 0 handoff
gc10(1): 0+0+0 ms 1 -> 0 MB 9415 -> 4769 (25061-20292) objects 0 handoff
gc11(1): 0+0+0 ms 1 -> 0 MB 6636 -> 4685 (26928-22243) objects 0 handoff
gc12(1): 0+0+0 ms 1 -> 0 MB 6741 -> 4802 (28984-24182) objects 0 handoff
gc13(1): 0+0+0 ms 1 -> 0 MB 9654 -> 5097 (33836-28739) objects 0 handoff
gc1(1): 0+0+0 ms 0 -> 0 MB 209 -> 171 (209-38) objects 0 handoff
Help me understand the first part i.e.
0 + 0 + 0 => Mark + Sweep + Clean times
Does 422 -> 346 means that there has been memory cleanup from 422MB to 346 MB?
If yes, then how come the memory is been reduced when there was nothing to be cleaned up?
In Go 1.5, the format of this output has changed considerably. For the full documentation, head over to http://godoc.org/runtime and search for "gctrace:"
gctrace: setting gctrace=1 causes the garbage collector to emit a single line to standard
error at each collection, summarizing the amount of memory collected and the
length of the pause. Setting gctrace=2 emits the same summary but also
repeats each collection. The format of this line is subject to change.
Currently, it is:
gc # ##s #%: #+...+# ms clock, #+...+# ms cpu, #->#-># MB, # MB goal, # P
where the fields are as follows:
gc # the GC number, incremented at each GC
##s time in seconds since program start
#% percentage of time spent in GC since program start
#+...+# wall-clock/CPU times for the phases of the GC
#->#-># MB heap size at GC start, at GC end, and live heap
# MB goal goal heap size
# P number of processors used
The phases are stop-the-world (STW) sweep termination, scan,
synchronize Ps, mark, and STW mark termination. The CPU times
for mark are broken down in to assist time (GC performed in
line with allocation), background GC time, and idle GC time.
If the line ends with "(forced)", this GC was forced by a
runtime.GC() call and all phases are STW.
The output is generated from this line: http://golang.org/src/pkg/runtime/mgc0.c?#L2147
So the different parts are:
0+0+0 ms : mark, sweep and clean duration in ms
1 -> 0 MB : heap before and after in MB
209 - 171 : objects before and after
(209-38) objects : number of allocs and frees
handoff (and in Go 1.2 steal and yields) are internals of the algorithm.

fast conversion from string time to milliseconds

For a vector or list of times, I'd like to go from a string time, e.g. 12:34:56.789 to milliseconds from midnight, which would be equal to 45296789.
This is what I do now:
toms = function(time) {
sapply(strsplit(time, ':', fixed = T),
function(x) sum(as.numeric(x)*c(3600000,60000,1000)))
}
and would like to do it faster.
Here's an example data set for benchmarking:
times = rep('12:34:56.789', 1e6)
system.time(toms(times))
# user system elapsed
# 9.00 0.04 9.05
You could use the fasttime package, which seems to be about an order of magnitude faster.
library(fasttime)
fasttoms <- function(time) {
1000*unclass(fastPOSIXct(paste("1970-01-01",time)))
}
times <- rep('12:34:56.789', 1e6)
system.time(toms(times))
# user system elapsed
# 6.61 0.03 6.68
system.time(fasttoms(times))
# user system elapsed
# 0.53 0.00 0.53
identical(fasttoms(times),toms(times))
# [1] TRUE

Resources