Chronicle-Queue: Why does busy-waiting tailer generate garbage when spinning (nothing to be read)? - garbage-collection

I have a busy-waiting loop in which a tailer is constantly trying to read from a queue:
final Bytes<ByteBuffer> bbb = Bytes.elasticByteBuffer(MAX_SIZE, MAX_SIZE);
// Busy wait loop.
while (true) {
tailer.readDocument(wire -> {
wire.read().marshallable(m -> {
m.read(DATA).bytes(bbb, true);
long rcvAt = m.read(RCVAT).int64();
System.out.println(rcvAt);
});
});
}
Why does this code generate garbage even when there is nothing to read from the queue?
Relevant JVM flags:
-server -XX:InitialHeapSize=64m -XX:MaxHeapSize=64m
GC logs and memory profile:
VisualVM
GC logs is flooded with logs like this:
...
[30.071s][info][gc ] GC(1755) Pause Young (Normal) (G1 Evacuation Pause) 23M->6M(30M) 0.250ms
[30.084s][info][gc ] GC(1756) Pause Young (Normal) (G1 Evacuation Pause) 23M->7M(30M) 0.386ms
[30.096s][info][gc ] GC(1757) Pause Young (Normal) (G1 Evacuation Pause) 24M->7M(30M) 0.544ms
[30.109s][info][gc ] GC(1758) Pause Young (Normal) (G1 Evacuation Pause) 24M->7M(30M) 0.759ms
[30.122s][info][gc ] GC(1759) Pause Young (Normal) (G1 Evacuation Pause) 24M->7M(30M) 0.808ms
[30.135s][info][gc ] GC(1760) Pause Young (Normal) (G1 Evacuation Pause) 24M->7M(30M) 0.937ms
...

You have capturing lambda. They hold a reference to the bbb and thus get created on every interation.
You can store them in local variables outside the loop to avoid them being created each time.
I suggest using Flight Recorder as it uses far less garbage to monitor the application

Related

MailboxProcessor Scan memory leak

Consider the following agent which uses Scan to process every message, with a message posted every millisecond.
open System.Threading
let mp = MailboxProcessor<string>.Start (fun inbox ->
let rec await () = inbox.Scan (fun msg ->
printfn "Received : %s" msg
printfn "Queue length: %i" inbox.CurrentQueueLength
Some <| await ())
await ())
while true do
mp.Post "word"
Thread.Sleep 1
If I monitor the memory usage of the process (macOS, via Activity Monitor), it grows and grows. Yet you can see from the printout that the queue length remains at 0.
Is this a memory leak bug in Scan or am I doing something wrong?

NodeJS, Promises and performance

My question is about performance in my NodeJS app...
If my program run 12 iteration of 1.250.000 each = 15.000.000 iterations all together - it takes dedicated servers at Amazon the following time to process:
r3.large: 2 vCPU, 6.5 ECU, 15 GB memory --> 123 minutes
4.8xlarge: 36 vCPU, 132 ECU, 60 GB memory --> 102 minutes
I have some code similair to the code below...
start();
start(){
for(var i=0; i<12; i++){
function2(); // Iterates over a collection - which contains data split up in intervals - by date intervals. This function is actually also recursive - due to the fact - that is run through the data many time (MAX 50-100 times) - due to different intervals sizes...
}
}
function2(){
return new Promise{
for(var i=0; i<1.250.000; i++){
return new Promise{
function3(); // This function simple iterate through all possible combinations - and call function3 - with all given values/combinations
}
}
}
}
function3(){
return new Promise{ // This function simple make some calculations based on the given values/combination - and then return the result to function2 - which in the end - decides which result/combination was the best...
}}
This is equal to 0.411 millisecond / 441 microseconds pér iteration!
When i look at performance and memory usage in the taskbar... the CPU is not running at 100% - but more like 50%...the entire time?
The memory usage starts very low - but KEEPS growing in GB - every minute until the process is done - BUT the (allocated) memory is first released when i press CTRL+C in the Windows CMD... so its like the NodeJS garbage collection doesn't not work optimal - or may be its simple the design of the code again...
When i execute the app i use the memory opt like:
node --max-old-space-size="50000" server.js
PLEASE tell me every thing you thing i can do - to make my program FASTER!
Thank you all - so much!
It's not that the garbage collector doesn't work optimally but that it doesn't work at all - you don't give it any chance to.
When developing the tco module that does tail call optimization in Node i noticed a strange thing. It seemed to leak memory and I didn't know why. It turned out that it was because of few console.log()
calls in various places that I used for testing to see what's going on because seeing a result of recursive call millions levels deep took some time so I wanted to see something while it was doing it.
Your example is pretty similar to that.
Remember that Node is single-threaded. When your computations run, nothing else can - including the GC. Your code is completely synchronous and blocking - even though it's generating millions of promises in a blocking manner. It is blocking because it never reaches the event loop.
Consider this example:
var a = 0, b = 10000000;
function numbers() {
while (a < b) {
console.log("Number " + a++);
}
}
numbers();
It's pretty simple - you want to print 10 million numbers. But when you run it it behaves very strangely - for example it prints numbers up to some point, and then it stops for several seconds, then it keeps going or maybe starts trashing if you're using swap, or maybe gives you this error that I just got right after seeing the Number 8486:
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Aborted
What's going on here is that the main thread is blocked in a synchronous loop where it keeps creating objects but the GC has no chance to release them.
For such long running tasks you need to divide your work and get into the event loop once in a while.
Here is how you can fix this problem:
var a = 0, b = 10000000;
function numbers() {
var i = 0;
while (a < b && i++ < 100) {
console.log("Number " + a++);
}
if (a < b) setImmediate(numbers);
}
numbers();
It does the same - it prints numbers from a to b but in bunches of 100 and then it schedules itself to continue at the end of the event loop.
Output of $(which time) -v node numbers1.js 2>&1 | egrep 'Maximum resident|FATAL'
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Maximum resident set size (kbytes): 1495968
It used 1.5GB of memory and crashed.
Output of $(which time) -v node numbers2.js 2>&1 | egrep 'Maximum resident|FATAL'
Maximum resident set size (kbytes): 56404
It used 56MB of memory and finished.
See also those answers:
How to write non-blocking async function in Express request handler
How node.js server serve next request, if current request have huge computation?
Maximum call stack size exceeded in nodejs
Node; Q Promise delay
How to avoid jimp blocking the code node.js

Why Go use cgo on Windows for a simple File.Write?

Rewriting a simple program from C# to Go, I found the resulting executable 3 to 4 times slower. Expecialy the Go version use 3 to 4 times more CPU. It's surprising because the code does many I/O and is not supposed to consume significant amount of CPU.
I made a very simple version only doing sequential writes, and made benchmarks. I ran the same benchmarks on Windows 10 and Linux (Debian Jessie). The time can't be compared (not the same systems, disks, ...) but the result is interesting.
I'm using the same Go version on both platforms : 1.6
On Windows os.File.Write use cgo (see runtime.cgocall below), not on Linux. Why ?
Here is the disk.go program :
package main
import (
"crypto/rand"
"fmt"
"os"
"time"
)
const (
// size of the test file
fullSize = 268435456
// size of read/write per call
partSize = 128
// path of temporary test file
filePath = "./bigfile.tmp"
)
func main() {
buffer := make([]byte, partSize)
seqWrite := func() error {
return sequentialWrite(filePath, fullSize, buffer)
}
err := fillBuffer(buffer)
panicIfError(err)
duration, err := durationOf(seqWrite)
panicIfError(err)
fmt.Printf("Duration : %v\n", duration)
}
// It's just a test ;)
func panicIfError(err error) {
if err != nil {
panic(err)
}
}
func durationOf(f func() error) (time.Duration, error) {
startTime := time.Now()
err := f()
return time.Since(startTime), err
}
func fillBuffer(buffer []byte) error {
_, err := rand.Read(buffer)
return err
}
func sequentialWrite(filePath string, fullSize int, buffer []byte) error {
desc, err := os.OpenFile(filePath, os.O_WRONLY|os.O_CREATE, 0666)
if err != nil {
return err
}
defer func() {
desc.Close()
err := os.Remove(filePath)
panicIfError(err)
}()
var totalWrote int
for totalWrote < fullSize {
wrote, err := desc.Write(buffer)
totalWrote += wrote
if err != nil {
return err
}
}
return nil
}
The benchmark test (disk_test.go) :
package main
import (
"testing"
)
// go test -bench SequentialWrite -cpuprofile=cpu.out
// Windows : go tool pprof -text -nodecount=10 ./disk.test.exe cpu.out
// Linux : go tool pprof -text -nodecount=10 ./disk.test cpu.out
func BenchmarkSequentialWrite(t *testing.B) {
buffer := make([]byte, partSize)
err := sequentialWrite(filePath, fullSize, buffer)
panicIfError(err)
}
The Windows result (with cgo) :
11.68s of 11.95s total (97.74%)
Dropped 18 nodes (cum <= 0.06s)
Showing top 10 nodes out of 26 (cum >= 0.09s)
flat flat% sum% cum cum%
11.08s 92.72% 92.72% 11.20s 93.72% runtime.cgocall
0.11s 0.92% 93.64% 0.11s 0.92% runtime.deferreturn
0.09s 0.75% 94.39% 11.45s 95.82% os.(*File).write
0.08s 0.67% 95.06% 0.16s 1.34% runtime.deferproc.func1
0.07s 0.59% 95.65% 0.07s 0.59% runtime.newdefer
0.06s 0.5% 96.15% 0.28s 2.34% runtime.systemstack
0.06s 0.5% 96.65% 11.25s 94.14% syscall.Write
0.05s 0.42% 97.07% 0.07s 0.59% runtime.deferproc
0.04s 0.33% 97.41% 11.49s 96.15% os.(*File).Write
0.04s 0.33% 97.74% 0.09s 0.75% syscall.(*LazyProc).Find
The Linux result (without cgo) :
5.04s of 5.10s total (98.82%)
Dropped 5 nodes (cum <= 0.03s)
Showing top 10 nodes out of 19 (cum >= 0.06s)
flat flat% sum% cum cum%
4.62s 90.59% 90.59% 4.87s 95.49% syscall.Syscall
0.09s 1.76% 92.35% 0.09s 1.76% runtime/internal/atomic.Cas
0.08s 1.57% 93.92% 0.19s 3.73% runtime.exitsyscall
0.06s 1.18% 95.10% 4.98s 97.65% os.(*File).write
0.04s 0.78% 95.88% 5.10s 100% _/home/sam/Provisoire/go-disk.sequentialWrite
0.04s 0.78% 96.67% 5.05s 99.02% os.(*File).Write
0.04s 0.78% 97.45% 0.04s 0.78% runtime.memclr
0.03s 0.59% 98.04% 0.08s 1.57% runtime.exitsyscallfast
0.02s 0.39% 98.43% 0.03s 0.59% os.epipecheck
0.02s 0.39% 98.82% 0.06s 1.18% runtime.casgstatus
Go does not perform file I/O, it delegates the task to the operating system. See the Go operating system dependent syscall packages.
Linux and Windows are different operating systems with different OS ABIs. For example, Linux uses syscalls via syscall.Syscall and Windows uses Windows dlls. On Windows, the dll call is a C call. It doesn't use cgo. It does go through the same dynamic C pointer check used by cgo, runtime.cgocall. There is no runtime.wincall alias.
In summary, different operating systems have different OS call mechanisms.
Command cgo
Passing pointers
Go is a garbage collected language, and the garbage collector needs
to know the location of every pointer to Go memory. Because of this,
there are restrictions on passing pointers between Go and C.
In this section the term Go pointer means a pointer to memory
allocated by Go (such as by using the & operator or calling the
predefined new function) and the term C pointer means a pointer to
memory allocated by C (such as by a call to C.malloc). Whether a
pointer is a Go pointer or a C pointer is a dynamic property
determined by how the memory was allocated; it has nothing to do with
the type of the pointer.
Go code may pass a Go pointer to C provided the Go memory to which it
points does not contain any Go pointers. The C code must preserve this
property: it must not store any Go pointers in Go memory, even
temporarily. When passing a pointer to a field in a struct, the Go
memory in question is the memory occupied by the field, not the entire
struct. When passing a pointer to an element in an array or slice, the
Go memory in question is the entire array or the entire backing array
of the slice.
C code may not keep a copy of a Go pointer after the call returns.
A Go function called by C code may not return a Go pointer. A Go
function called by C code may take C pointers as arguments, and it may
store non-pointer or C pointer data through those pointers, but it may
not store a Go pointer in memory pointed to by a C pointer. A Go
function called by C code may take a Go pointer as an argument, but it
must preserve the property that the Go memory to which it points does
not contain any Go pointers.
Go code may not store a Go pointer in C memory. C code may store Go
pointers in C memory, subject to the rule above: it must stop storing
the Go pointer when the C function returns.
These rules are checked dynamically at runtime. The checking is
controlled by the cgocheck setting of the GODEBUG environment
variable. The default setting is GODEBUG=cgocheck=1, which implements
reasonably cheap dynamic checks. These checks may be disabled entirely
using GODEBUG=cgocheck=0. Complete checking of pointer handling, at
some cost in run time, is available via GODEBUG=cgocheck=2.
It is possible to defeat this enforcement by using the unsafe package,
and of course there is nothing stopping the C code from doing anything
it likes. However, programs that break these rules are likely to fail
in unexpected and unpredictable ways.
"These rules are checked dynamically at runtime."
Benchmarks:
To paraphrase, there are lies, damn lies, and benchmarks.
For valid comparisons across operating systems you need to run on identical hardware. For example, the difference between CPUs, memory, and rust or silicon disk I/O. I dual-boot Linux and Windows on the same machine.
Run benchmarks at least three times back-to-back. Operating systems try to be smart. For example, caching I/O. Languages using virtual machines need warm-up time. And so on.
Know what you are measuring. If you are doing sequential I/O, you spend almost all your time in the operating system. Have you turned off malware protection? And so on.
And so on.
Here are some results for disk.go from the same machine using dual-boot Windows and Linux.
Windows:
>go build disk.go
>/TimeMem disk
Duration : 18.3300322s
Elapsed time : 18.38
Kernel time : 13.71 (74.6%)
User time : 4.62 (25.1%)
Linux:
$ go build disk.go
$ time ./disk
Duration : 18.54350723s
real 0m18.547s
user 0m2.336s
sys 0m16.236s
Effectively, they are the same, 18 seconds disk.go duration. Just some variation between operating systems as to what is counted user time and what is counted as kernel or system time. Elapsed or real time is the same.
In your tests, kernel or system time was 93.72% runtime.cgocall versus 95.49% syscall.Syscall.

Will the following use of strdup() cause a memory leak in C ?

char* XX (char* str)
{
// CONCAT an existing string with str , and return to user
}
And i call this program by:
XX ( strdup("CHCHCH") );
Will this cause a leak while not releasing what strdup() generates ?
It's unlikely that free the result of XX() will do the job.
(Please let me know both in C and C++ , thanks !)
Unless the XX function free()'s the argument passed in, yes this will cause a memory leak in both C and C++.
Yes. Something has to free the result of strdup.
You could consider using Boehm's garbage collector and use GC_strdup & GC_malloc instead of strdup & malloc; then you don't need to bother about calling free
Yes this will leak. strdup's results must be freed.
For C++, on the other hand, I recommend using std::string rather than char*:
std::string XX( std::string const & in )
{
return in + std::string( "Something to append" );
}
This is a quick-and dirty way to implement what you're talking about, but it is very readable. You can obtain some speed improvement by passing in a mutable reference to a string for output, but unless this is in a very tight loop, there is little reason to do so, as it would likely add increased complication without booting performance significantly.

How can I keep memory from exploding when child processes touch variable metadata?

Linux uses COW to keep memory usage low after a fork, but the way Perl 5 variables work in perl seems to defeat this optimization. For instance, for the variable:
my $s = "1";
perl is really storing:
SV = PV(0x100801068) at 0x1008272e8
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x100201d50 "1"\0
CUR = 1
LEN = 16
When you use that string in a numeric context, it modifies the C struct representing the data:
SV = PVIV(0x100821610) at 0x1008272e8
REFCNT = 1
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 1
PV = 0x100201d50 "1"\0
CUR = 1
LEN = 16
The string pointer itself did not change (it is still 0x100201d50), but now it is in a different C struct (a PVIV instead of a PV). I did not modify the value at all, but suddenly I am paying a COW cost. Is there any way to lock the perl representation of a Perl 5 variable so that this time saving (perl doesn't have to convert "0" to 0 a second time) hack doesn't hurt my memory usage?
Note, the representations above were generated from this code:
perl -MDevel::Peek -e '$s = "1"; Dump $s; $s + 0; Dump $s'
The only solution I have found so far, is to make sure I force perl to do all of the conversions I expect in the parent process. And you can see from the code below, even that only helps a little.
Results:
Useless use of addition (+) in void context at z.pl line 34.
Useless use of addition (+) in void context at z.pl line 45.
Useless use of addition (+) in void context at z.pl line 51.
before eating memory
used memory: 71
after eating memory
used memory: 119
after 100 forks that don't reference variable
used memory: 144
after children are reaped
used memory: 93
after 100 forks that touch the variables metadata
used memory: 707
after children are reaped
used memory: 93
after parent has updated the metadata
used memory: 109
after 100 forks that touch the variables metadata
used memory: 443
after children are reaped
used memory: 109
Code:
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
sub print_mem {
print #_, "used memory: ", `free -m` =~ m{cache:\s+([0-9]+)}s, "\n";
}
print_mem("before eating memory\n");
my #big = ("1") x (1_024 * 1024);
my $pm = Parallel::ForkManager->new(100);
print_mem("after eating memory\n");
for (1 .. 100) {
next if $pm->start;
sleep 2;
$pm->finish;
}
print_mem("after 100 forks that don't reference variable\n");
$pm->wait_all_children;
print_mem("after children are reaped\n");
for (1 .. 100) {
next if $pm->start;
$_ + 0 for #big; #force an update to the metadata
sleep 2;
$pm->finish;
}
print_mem("after 100 forks that touch the variables metadata\n");
$pm->wait_all_children;
print_mem("after children are reaped\n");
$_ + 0 for #big; #force an update to the metadata
print_mem("after parent has updated the metadata\n");
for (1 .. 100) {
next if $pm->start;
$_ + 0 for #big; #force an update to the metadata
sleep 2;
$pm->finish;
}
print_mem("after 100 forks that touch the variables metadata\n");
$pm->wait_all_children;
print_mem("after children are reaped\n");
Anyway if you avoid COW on start and during run you should not forgot END phase of lifetime. In shutdown there are two GC phases when in first there are ref counts updates so it can kill you in nice way. You can in solve it ugly:
END { kill 9, $$ }
This goes without saying, but COW doesn't happen on a per-struct basis, but on a memory page basis. So it's enough that one thing in an entire memory page be modified like this for you to pay the copying cost.
On Linux you can query the page size like this:
getconf PAGESIZE
On my system that's 4096 bytes. You can fit a lot of Perl scalar structs in that space. If one of those things gets modified Linux will have to copy the entire thing.
This is why using memory arenas is a good idea in general. You should separate your mutable and immutable data so that you won't have to pay COW costs for immutable data just because it happened to reside in the same memory page as mutable data.

Resources