Memory leak using CGAL's exact kernel - memory-leaks

I am working on a project that uses OpenMP and the exact kernel of CGAL. As I was running my code in debug mode, and thanks to Visual Studio Leak Detector, I discovered a lot of unexpected memory leaks that are obviously related to the combination of those features. I know that an instance of CGAL::Lazy_exact_nt<CGAL::Gmpq> a.k.a. FT should not be shared between multiple threads, but since it is not the case here, I thought it was safe. Is there any way to fix these leaks ?
Here is a minimal reproducible example:
int main()
{
double xs_h = 1.2;
#pragma omp parallel
{
FT xs;
std::stringstream stream;
stream << xs_h;
stream >> xs;
}
}
And here is (part of) the output of VLD (7 memory leaks in total, all pointing to the same line of code):
---------- Block 26 at 0x00000000FEED2E50: 40 bytes ----------
Leak Hash: 0x6B0C3782, Count: 1, Total 40 bytes
Call Stack (TID 15080):
ucrtbased.dll!malloc()
D:\agent\_work\9\s\src\vctools\crt\vcstartup\src\heap\new_scalar.cpp (35): Test.exe!operator new() + 0xA bytes
D:\CGAL-4.13\include\CGAL\Lazy.h (818): Test.exe!CGAL::Lazy<CGAL::Interval_nt<0>,CGAL::Gmpq,CGAL::Lazy_exact_nt<CGAL::Gmpq>,CGAL::To_interval<CGAL::Gmpq> >::zero() + 0x77 bytes
D:\CGAL-4.13\include\CGAL\Lazy.h (766): Test.exe!CGAL::Lazy<CGAL::Interval_nt<0>,CGAL::Gmpq,CGAL::Lazy_exact_nt<CGAL::Gmpq>,CGAL::To_interval<CGAL::Gmpq> >::Lazy<CGAL::Interval_nt<0>,CGAL::Gmpq,CGAL::Lazy_exact_nt<CGAL::Gmpq>,CGAL::To_interval<CGAL::Gmpq> >() + 0x5 bytes
D:\CGAL-4.13\include\CGAL\Lazy_exact_nt.h (365): Test.exe!CGAL::Lazy_exact_nt<CGAL::Gmpq>::Lazy_exact_nt<CGAL::Gmpq>() + 0x28 bytes
D:\projects\...\src\main.cpp (20): Test.exe!main$omp$1() + 0xA bytes
VCOMP140D.DLL!vcomp_fork() + 0x2E5 bytes
VCOMP140D.DLL!vcomp_fork() + 0x2A1 bytes
VCOMP140D.DLL!vcomp_atomic_div_r8() + 0x20A bytes
KERNEL32.DLL!BaseThreadInitThunk() + 0x14 bytes
ntdll.dll!RtlUserThreadStart() + 0x21 bytes
I am using CGAL 4.13, and my compiler is Visual Studio 2019. I tried recompiling this code with macro CGAL_HAS_THREADS but it does not change the result (memory leaks do not vanish). Thanks for your attention.

Related

NtQueryObject returns wrong insufficient required size via WOW64, why?

I am using the NT native API NtQueryObject()/ZwQueryObject() from user mode (and I am aware of the risks in general and I have written kernel mode drivers for Windows in the past in my professional capacity).
Generally when one uses the typical "query information" function (of which there are a few) the protocol is first to ask with a too small buffer to retrieve the required size with STATUS_INFO_LENGTH_MISMATCH, then allocate a buffer of said size and query again -- this time using the buffer and previously returned size.
In order to get the list of object types (67 on my build) on the system I am doing just that:
ULONG Size = 0;
NTSTATUS Status = NtQueryObject(NULL, ObjectTypesInformation, &Size, sizeof(Size), &Size);
And in Size I get 8280 (WOW64) and 8968 (x64). I then proceed to allocate the buffer with calloc() and query again:
ULONG Size2 = 0;
BYTE* Buf = (BYTE*)::calloc(1, Size);
Status = NtQueryObject(NULL, ObjectTypesInformation, Buf, Size, &Size2);
NB: ObjectTypesInformation is 3. It isn't declared in winternl.h, but Nebbett (as ObjectAllTypesInformation) and others describe it. Since I am not querying for a particular object's traits but the system-wide list of object types, I pass NULL for the object handle.
Curiously on WOW64, i.e. 32-bit, the value in Size2 upon return from the second query is 16 Bytes (= 8296) bigger than the previously returned required size.
As far as alignment is concerned, I'd expect at most 8 Bytes for this sort of thing and indeed neither 8280 nor 8296 are at a 16 Byte alignment boundary, but on an 8 Byte one.
Certainly I can add some slack space on top of the returned required size (e.g. ALIGN_UP to the next 32 Byte alignment boundary), but this seems highly irregular to be honest. And I'd rather want to understand what's going on than to implement a workaround that breaks, because I miss something crucial.
The practical issue for the code is that in Debug configurations it tells me there's a corrupted heap somewhere, upon freeing Buf. Which suggests that NtQueryObject() was indeed writing these extra 16 Bytes beyond the buffer I provided.
Question: Any idea why it is doing that?
As usual for NT native API the sources of information are scarce. The x64 version of the exact same code returns the exact number of bytes required. So my thinking here is that WOW64 is the issue. A somewhat cursory look into wow64.dll with IDA didn't reveal any immediate points for suspicion regarding what goes wrong in translating the results to 32-bit here.
PS: Windows 10 (10.0.19043, ntdll.dll "timestamp" 77755782)
PPS: this may be related: https://wj32.org/wp/2012/11/30/obquerytypeinfo-and-ntqueryobject-buffer-overrun-in-windows-8/ Tested it, by checking that OBJECT_TYPE_INFORMATION::TypeName.Length + sizeof(WCHAR) == OBJECT_TYPE_INFORMATION::TypeName.MaximumLength in all returned items, which was the case.
The only part of ObjectTypesInformation that's public is the first field defined in winternl.h header in the Windows SDK:
typedef struct __PUBLIC_OBJECT_TYPE_INFORMATION {
UNICODE_STRING TypeName;
ULONG Reserved [22]; // reserved for internal use
} PUBLIC_OBJECT_TYPE_INFORMATION, *PPUBLIC_OBJECT_TYPE_INFORMATION;
For x86 this is 96 bytes, and for x64 this is 104 bytes (assuming you have the right packing mode enabled). The difference is the pointer in UNICODE_STRING which changes the alignment in x64.
Any additional memory space should be related to the TypeName buffer.
UNICODE_STRING accounts for 8 bytes of the difference between 8280 and 8296. The function uses the sizeof(ULONG_PTR) for alignment of the returned string plus an extra WCHAR, so that could easily account for the remaining 8 bytes.
AFAIK: The public use of NtQueryObject is supposed to be limited to kernel-mode use which of course means it always matches the OS native bitness (x86 code can't run as kernel in x64 native OS), so it's probably just a quirk of using the NT functions via the WOW64 thunk.
Alright, I think I figured out the issue with the help of WinDbg and a thorough look at wow64.dll using IDA.
NB: the wow64.dll I have has the same build number, but differs slightly in data only (checksum, security directory entry, pieces from version resources). The code is identical, which was to be expected, given deterministic builds and how they affect the PE timestamp.
There's an internal function called whNtQueryObject_SpecialQueryCase (according to PDBs), which covers the ObjectTypesInformation class queries.
For the above wow64.dll I used the following points of interest in WinDbg, from a 32 bit program which calls NtQueryObject(NULL, ObjectTypesInformation, ...) (the program itself is irrelevant, though):
0:000> .load wow64exts
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B0E0
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B14E
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B1A7
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B24A
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B252
Explanation of the above points of interest:
+B0E0: computing length required for 64 bit query, based on passed length for 32 bit
+B14E: call to NtQueryObject()
+B1A7: loop body for copying 64 to 32 bit buffer contents, after successful NtQueryObject() call
+B24A: computing written length by subtracting current (last + 1) entry from base buffer address
+B252: downsizing returned (64 bit) required length to 32 bit
The logic of this function in regards to just ObjectTypesInformation is roughly as follows:
Common steps
Take the ObjectInformationLength (32 bit query!) argument and size it up to fit the 64 bit info
Align the retrieved size up to the next 16 byte boundary
If necessary allocate the resulting amount from some PEB::ProcessHeap and store in TLS slot 3; otherwise using this as a scratch space
Call NtQueryObject() passing the buffer and length from the two previous steps
The length passed to NtQueryObject() is the one from step 1, not the one aligned to a 16 byte boundary. There seems to be some sort of header to this scratch space, so perhaps that's where the 16 byte alignment comes from?
Case 1: buffer size too small (here: 4), just querying required length
The up-sized length in this case equals 4, which is too small and consequently NtQueryObject() returns STATUS_INFO_LENGTH_MISMATCH. Required size is reported as 8968.
Down-size from the 64 bit required length to 32 bit and end up 16 bytes too short
Return the status from NtQueryObject() and the down-sized required length form the previous step
Case 2: buffer size supposedly (!) sufficient
Copy OBJECT_TYPES_INFORMATION::NumberOfTypes from queried buffer to 32 bit one
Step to the first entry (OBJECT_TYPE_INFORMATION) of source (64 bit) and target (32 bit) buffer, 8 and 4 byte aligned respectively
For for each entry up to OBJECT_TYPES_INFORMATION::NumberOfTypes:
Copy UNICODE_STRING::Length and UNICODE_STRING::MaximumLength for TypeName member
memcpy() UNICODE_STRING::Length bytes from source to target UNICODE_STRING::Buffer (target entry + sizeof(OBJECT_TYPE_INFORMATION32)
Add terminating zero (WCHAR) past the memcpy'd string
Copy the individual members past the TypeName from 64 to 32 bit struct
Compute pointer of next entry by aligning UNICODE_STRING::MaximumLength up to an 8 byte boundary (i.e. the ULONG_PTR alignment mentioned in the other answer) + sizeof(OBJECT_TYPE_INFORMATION64) (already 8 byte aligned!)
The next target entry (32 bit) gets 4 byte aligned instead
At the end compute required (32 bit) length by subtracting the value we arrived at for the "next" entry (i.e. one past the last) from the base address of the buffer passed by the WOW64 program (32 bit) to NtQueryObject()
In my debugged scenario these were: 0x008ce050 - 0x008cbfe8 = 0x00002068 (= 8296), which is 16 bytes larger than the buffer length we were told during case 1 (8280)!
The issue
That crucial last step differs between merely querying and actually getting the buffer filled. There is no further bounds checking in that loop I described for case 2.
And this means it will just overrun the passed buffer and return a written length bigger than the buffer length passed to it.
Possible solutions and workarounds
I'll have to approach this mathematically after some sleep, the workaround is obviously to top up the required length returned from case 1 in order to avoid the buffer overrun. The easiest method is to use my up_size_from_32bit() from the example below and use that on the returned required size. This way you are allocating enough for the 64 bit buffer, while querying the 32 bit one. This should never overrun during the copy loop.
However, the fix in wow64.dll is a little more involved, I guess. While adding bounds checking to the loop would help avert the overrun, it would mean that the caller would have to query for the required size twice, because the first time around it lies to us.
Which means the query-only case (1) would have to allocate that internal buffer after querying the required length for 64 bit, then get it filled and then walk the entries (just like the copy loop), skipping over the last entry to compute the required length the same as it is now done after the copy loop.
Example program demonstrating the "static" computation by wow64.dll
Build for x64, just the way wow64.dll was!
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <cstdio>
typedef struct
{
ULONG JustPretending[24];
} OBJECT_TYPE_INFORMATION32;
typedef struct
{
ULONG JustPretending[26];
} OBJECT_TYPE_INFORMATION64;
constexpr ULONG size_delta_3264 = sizeof(OBJECT_TYPE_INFORMATION64) - sizeof(OBJECT_TYPE_INFORMATION32);
constexpr ULONG down_size_to_32bit(ULONG len)
{
return len - size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION64));
}
constexpr ULONG up_size_from_32bit(ULONG len)
{
return len + size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION32));
}
// Trying to mimic the wdm.h macro
constexpr size_t align_up_by(size_t address, size_t alignment)
{
return (address + (alignment - 1)) & ~(alignment - 1);
}
constexpr auto u32 = 8280UL;
constexpr auto u64 = 8968UL;
constexpr auto from_64 = down_size_to_32bit(u64);
constexpr auto from_32 = up_size_from_32bit(u32);
constexpr auto from_32_16_byte_aligned = (ULONG)align_up_by(from_32, 16);
int wmain()
{
wprintf(L"32 to 64 bit: %u -> %u -(16-byte-align)-> %u\n", u32, from_32, from_32_16_byte_aligned);
wprintf(L"64 to 32 bit: %u -> %u\n", u64, from_64);
return 0;
}
static_assert(sizeof(OBJECT_TYPE_INFORMATION32) == 96, "Size for 64 bit struct does not match.");
static_assert(sizeof(OBJECT_TYPE_INFORMATION64) == 104, "Size for 64 bit struct does not match.");
static_assert(u32 == from_64, "Must match (from 64 to 32 bit)");
static_assert(u64 == from_32, "Must match (from 32 to 64 bit)");
static_assert(from_32_16_byte_aligned % 16 == 0, "16 byte alignment failed");
static_assert(from_32_16_byte_aligned > from_32, "We're aligning up");
This does not mimic the computation that happens in case 2, though.

Twice as many page faults when reading from a large malloced array instead of just storing?

I am doing a simple test on monitoring page faults with the code below, What I don't know is how a simple one line of code below doubled my page fault count.
if I use
ptr[i+4096] = 'A'
I got 25,722 page-faults with perf tool, which is what I expected,
but if I use
tmp = ptr[i+4096]
instead, the page-faults doubled to 51,322
I don't how to explain it. Below is the complete code. Thanks!
void do_something() {
int i;
char* ptr;
char tmp;
ptr = malloc(100*1024*1024);
int j = 0;
int k = 0;
for (i = 0; i < 100*1024*1024; i+=4096) {
//ptr[i+4096] = 'A' ;
tmp = ptr[i+4096];
for (j = 0 ; j < 4096; j++)
ptr[i+j] = (char) (i & 0xff); // pagefault
}
free(ptr);
}
int main(int argc, char* argv[]) {
do_something();
return 0;
}
Machine Info:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2687W v3 # 3.10GHz
Stepping: 2
CPU MHz: 3096.188
BogoMIPS: 6197.81
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
3.10.0-514.32.3.el7.x86_64 #1
malloc() will often satisfy requests for memory by asking the OS for new pages, e.g., via mmap. Such pages are generally allocated lazily: no actual page is allocated until the first access.
What happens then depends on the type of the first access: when you do a read first, Linux will map in a shared read-only COW page of zeros to satisfy it, and then if you later you write it takes a second fault to allocate the private writeable page.
When you just do the write first, the first step is skipped. That's the usual case since code generally isn't reading from newly allocated memory which has undefined contents (at least when you get it from malloc).
Note that the above is a description of how newly allocated pages work in Linux - when you use malloc there is another layer: malloc will generally try to satisfy requests for blocks the process freed earlier, rather than continually requesting new memory. In the case memory is re-used, it will generally already be paged in and the above won't apply. Of course for your initial big allocation of 1024 MiB, where is no memory to re-use so you can be sure the allocator is getting it from the OS.

Assertion in appcore.cpp while loading regular DLL dynamically linked to MFC

I have inherited an application which consists of a regular DLL which is dynamically linked to MFC and which is loaded from a Windows service executable which also links dynamically to MFC. The code is being compiled using Microsoft Visual Studio 2008 Professional (old, I know...). This application has been 'working' for several years but I have found that I cannot run it as Debug build due to the following assertion in appcore.cpp:
Debug Assertion Failed!
Program: C:\Projects\CMM\Debug\CMM.exe
File: f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\appcore.cpp
Line: 380
For more information on how your program can cause an assertion
failure, see the Visual C++ documentation on asserts.
(Press Retry to debug the application)
which corresponds to the following code in the CWinApp constructor:
ASSERT(AfxGetThread() == NULL);
pThreadState->m_pCurrentWinThread = this;
ASSERT(AfxGetThread() == this);
This occurs when loading the DLL via LoadLibrary and leads me to suspect that the application has worked more by luck than judgement over the years (due to ASSERT not being included as part of the Release build).
My (admittedly limited) understanding of MFC is that although there should generally only be a single instance of CWinApp, it is permissible to have an additional one in regular DLLs which link dynamically to MFC, as in this case. The code has one instance in the service executable and one in the DLL. The CWinApp constructor gets called three (?) times, once for some internal instance within the MFC framework, once for the instance in the service executable and once for the instance in the DLL. The first two work fine, it is the third which blows up.
All of the exported functions start with AFX_MANAGE_STATE (although execution never gets that far) and the pre-processor flags are, I believe, correct w.r.t. Microsoft's documentation (_AFXDLL for the EXE, _AFXDLL, _USRDLL and _WINDLL for the DLL).
I have tried using AfxLoadLibrary instead of LoadLibrary to no effect. However, if I include
AFX_MANAGE_STATE( AfxGetStaticModuleState() )
at the start of the function which calls LoadLibrary/AfxLoadLibrary, the CWinApp object is actually constructed but execution then blows up in dllmodul.cpp instead.
Can anybody shed any light on why this might be happening or what I need to do to fix it?
EDIT
This is the call stack when the assertion occurs:
mfc90d.dll!CWinApp::CWinApp(const char * lpszAppName=0x00000000) Line 380 + 0x1c bytes C++
cimdll.dll!CCimDllApp::CCimDllApp() Line 146 + 0x19 bytes C++
cimdll.dll!`dynamic initializer for 'theApp''() Line 129 + 0xd bytes C++
msvcr90d.dll!_initterm(void (void)* * pfbegin=0x1b887c88, void (void)* * pfend=0x1b887d6c) Line 903 C
cimdll.dll!_CRT_INIT(void * hDllHandle=0x1b770000, unsigned long dwReason=1, void * lpreserved=0x00000000) Line 318 + 0xf bytes C
cimdll.dll!__DllMainCRTStartup(void * hDllHandle=0x1b770000, unsigned long dwReason=1, void * lpreserved=0x00000000) Line 540 + 0x11 bytes C
cimdll.dll!_DllMainCRTStartup(void * hDllHandle=0x1b770000, unsigned long dwReason=1, void * lpreserved=0x00000000) Line 510 + 0x11 bytes C
ntdll.dll!_LdrxCallInitRoutine#16() + 0x16 bytes
ntdll.dll!LdrpCallInitRoutine() + 0x43 bytes
ntdll.dll!LdrpInitializeNode() + 0x101 bytes
ntdll.dll!LdrpInitializeGraphRecurse() + 0x71 bytes
ntdll.dll!LdrpPrepareModuleForExecution() + 0x8b bytes
ntdll.dll!LdrpLoadDllInternal() + 0x121 bytes
ntdll.dll!LdrpLoadDll() + 0x92 bytes
ntdll.dll!_LdrLoadDll#16() + 0xd9 bytes
KernelBase.dll!_LoadLibraryExW#12() + 0x138 bytes
KernelBase.dll!_LoadLibraryExA#12() + 0x26 bytes
KernelBase.dll!_LoadLibraryA#4() + 0x32 bytes
mfc90d.dll!AfxCtxLoadLibraryA(const char * lpLibFileName=0x02a70ce0) Line 487 + 0x74 bytes C++
mfc90d.dll!AfxLoadLibrary(const char * lpszModuleName=0x02a70ce0) Line 193 + 0x9 bytes C++
CMM.exe!CMonDevDll::LoadDLL() Line 207 + 0x1b bytes C++
CMM.exe!CMonDevDll::LoadDllEntryPoints() Line 268 + 0x8 bytes C++
CMM.exe!CMonDevDll::Initialize(CMonDevRun * pMonDevRun=0x0019fe60) Line 186 + 0x8 bytes C++
CMM.exe!CCtcLinkMonDev::Initialize(CMonDevRun * pMonDevRun=0x0019fe60, CCtcRegistry & reg={...}, int nLinkId=1) Line 546 + 0x18 bytes C++
CMM.exe!CCtcLinkSwitchMgr::Initialize(CMonDevRun * pMonDevRun=0x0019fe60, CCtcRegistry & reg={...}) Line 188 + 0x14 bytes C++
CMM.exe!CMonDevRun::Initialize(ATL::CStringT<char,StrTraitMFC_DLL<char,ATL::ChTraitsCRT<char> > > szServiceName="CimService") Line 257 + 0x16 bytes C++
CMM.exe!CMonDevService::Run() Line 202 + 0x2d bytes C++
CommonFilesD.dll!CCtcServiceBase::ParseStandardArgs(int argc=-1, char * * argv=0x02a51b44) Line 278 + 0xf bytes C++
CMM.exe!main(int argc=4, char * * argv=0x02a51b38) Line 126 + 0x16 bytes C++
CMM.exe!__tmainCRTStartup() Line 586 + 0x19 bytes C
CMM.exe!mainCRTStartup() Line 403 C
kernel32.dll!#BaseThreadInitThunk#12() + 0x24 bytes
ntdll.dll!__RtlUserThreadStart() + 0x2f bytes
ntdll.dll!__RtlUserThreadStart#8() + 0x1b bytes
I finally managed to track down the cause of my crash. A library used by my DLL was linking statically to Boost.Thread and causing this issue, presumably due to a runtime mismatch. Changing the library to link dynamically to Boost instead seems to have fixed the problem.

Golang : fatal error: runtime: out of memory

I trying to use this package in Github for string matching. My dictionary is 4 MB. When creating the Trie, I got fatal error: runtime: out of memory. I am using Ubuntu 14.04 with 8 GB of RAM and Golang version 1.4.2.
It seems the error come from the line 99 (now) here : m.trie = make([]node, max)
The program stops at this line.
This is the error:
fatal error: runtime: out of memory
runtime stack:
runtime.SysMap(0xc209cd0000, 0x3b1bc0000, 0x570a00, 0x5783f8)
/usr/local/go/src/runtime/mem_linux.c:149 +0x98
runtime.MHeap_SysAlloc(0x57dae0, 0x3b1bc0000, 0x4296f2)
/usr/local/go/src/runtime/malloc.c:284 +0x124
runtime.MHeap_Alloc(0x57dae0, 0x1d8dda, 0x10100000000, 0x8)
/usr/local/go/src/runtime/mheap.c:240 +0x66
goroutine 1 [running]:
runtime.switchtoM()
/usr/local/go/src/runtime/asm_amd64.s:198 fp=0xc208518a60 sp=0xc208518a58
runtime.mallocgc(0x3b1bb25f0, 0x4d7fc0, 0x0, 0xc20803c0d0)
/usr/local/go/src/runtime/malloc.go:199 +0x9f3 fp=0xc208518b10 sp=0xc208518a60
runtime.newarray(0x4d7fc0, 0x3a164e, 0x1)
/usr/local/go/src/runtime/malloc.go:365 +0xc1 fp=0xc208518b48 sp=0xc208518b10
runtime.makeslice(0x4a52a0, 0x3a164e, 0x3a164e, 0x0, 0x0, 0x0)
/usr/local/go/src/runtime/slice.go:32 +0x15c fp=0xc208518b90 sp=0xc208518b48
github.com/mf/ahocorasick.(*Matcher).buildTrie(0xc2083c7e60, 0xc209860000, 0x26afb, 0x2f555)
/home/go/ahocorasick/ahocorasick.go:104 +0x28b fp=0xc208518d90 sp=0xc208518b90
github.com/mf/ahocorasick.NewStringMatcher(0xc208bd0000, 0x26afb, 0x2d600, 0x8)
/home/go/ahocorasick/ahocorasick.go:222 +0x34b fp=0xc208518ec0 sp=0xc208518d90
main.main()
/home/go/seme/substrings.go:66 +0x257 fp=0xc208518f98 sp=0xc208518ec0
runtime.main()
/usr/local/go/src/runtime/proc.go:63 +0xf3 fp=0xc208518fe0 sp=0xc208518f98
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2232 +0x1 fp=0xc208518fe8 sp=0xc208518fe0
exit status 2
This is the content of the main function (taken from the same repo: test file)
var dictionary = InitDictionary()
var bytes = []byte(""Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney) says they will pay this on the usual credit terms (30 days).")
var precomputed = ahocorasick.NewStringMatcher(dictionary)// line 66 here
fmt.Println(precomputed.Match(bytes))
Your structure is awfully inefficient in terms of memory, let's look at the internals. But before that, a quick reminder of the space required for some go types:
bool: 1 byte
int: 4 bytes
uintptr: 4 bytes
[N]type: N*sizeof(type)
[]type: 12 + len(slice)*sizeof(type)
Now, let's have a look at your structure:
type node struct {
root bool // 1 byte
b []byte // 12 + len(slice)*1
output bool // 1 byte
index int // 4 bytes
counter int // 4 bytes
child [256]*node // 256*4 = 1024 bytes
fails [256]*node // 256*4 = 1024 bytes
suffix *node // 4 bytes
fail *node // 4 bytes
}
Ok, you should have a guess of what happens here: each node weighs more than 2KB, this is huge ! Finally, we'll look at the code that you use to initialize your trie:
func (m *Matcher) buildTrie(dictionary [][]byte) {
max := 1
for _, blice := range dictionary {
max += len(blice)
}
m.trie = make([]node, max)
// ...
}
You said your dictionary is 4 MB. If it is 4MB in total, then it means that at the end of the for loop, max = 4MB. It it holds 4 MB different words, then max = 4MB*avg(word_length).
We'll take the first scenario, the nicest one. You are initializing a slice of 4M of nodes, each of which uses 2KB. Yup, that makes a nice 8GB necessary.
You should review how you build your trie. From the wikipedia page related to the Aho-Corasick algorithm, each node contains one character, so there is at most 256 characters that go from the root, not 4MB.
Some material to make it right: https://web.archive.org/web/20160315124629/http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf
The node type has a memory size of 2084 bytes.
I wrote a litte program to demonstrate the memory usage: https://play.golang.org/p/szm7AirsDB
As you can see, the three strings (11(+1) bytes in size) dictionary := []string{"fizz", "buzz", "123"} require 24 MB of memory.
If your dictionary has a length of 4 MB you would need about 4000 * 2084 = 8.1 GB of memory.
So you should try to decrease the size of your dictionary.
Set resource limit to unlimited worked for me
if ulimit -a return 0 run ulimit -c unlimited
Maybe set a real size limit to be more secure

Can malloc_trim() release memory from the middle of the heap?

I am confused about the behaviour of malloc_trim as implemented in the glibc.
man malloc_trim
[...]
malloc_trim - release free memory from the top of the heap
[...]
This function cannot release free memory located at places other than the top of the heap.
When I now look up the source of malloc_trim() (in malloc/malloc.c) I see that it calls mtrim() which is utilizing madvise(x, MADV_DONTNEED) to release memory back to the operating system.
So I wonder if the man-page is wrong or if I misinterpret the source in malloc/malloc.c.
Can malloc_trim() release memory from the middle of the heap?
There are two usages of madvise with MADV_DONTNEED in glibc now: http://code.metager.de/source/search?q=MADV_DONTNEED&path=%2Fgnu%2Fglibc%2Fmalloc%2F&project=gnu
H A D arena.c 643 __madvise ((char *) h + new_size, diff, MADV_DONTNEED);
H A D malloc.c 4535 __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
There was https://sourceware.org/git/?p=glibc.git;a=commit;f=malloc/malloc.c;h=68631c8eb92ff38d9da1ae34f6aa048539b199cc commit by Ulrich Drepper on 16 Dec 2007 (part of glibc 2.9 and newer):
malloc/malloc.c (public_mTRIm): Iterate over all arenas and call
mTRIm for all of them.
(mTRIm): Additionally iterate over all free blocks and use madvise
to free memory for all those blocks which contain at least one
memory page.
mTRIm (now mtrim) implementation was changed. Unused parts of chunks, aligned on page size and having size more than page may be marked as MADV_DONTNEED:
/* See whether the chunk contains at least one unused page. */
char *paligned_mem = (char *) (((uintptr_t) p
+ sizeof (struct malloc_chunk)
+ psm1) & ~psm1);
assert ((char *) chunk2mem (p) + 4 * SIZE_SZ <= paligned_mem);
assert ((char *) p + size > paligned_mem);
/* This is the size we could potentially free. */
size -= paligned_mem - (char *) p;
if (size > psm1)
madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
Man page of malloc_trim is there: https://github.com/mkerrisk/man-pages/blob/master/man3/malloc_trim.3 and it was committed by kerrisk in 2012: https://github.com/mkerrisk/man-pages/commit/a15b0e60b297e29c825b7417582a33e6ca26bf65
As I can grep the glibc's git, there are no man pages in the glibc, and no commit to malloc_trim manpage to document this patch. The best and the only documentation of glibc malloc is its source code: https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c
Additional functions:
malloc_trim(size_t pad);
609 /*
610 malloc_trim(size_t pad);
611
612 If possible, gives memory back to the system (via negative
613 arguments to sbrk) if there is unused memory at the `high' end of
614 the malloc pool. You can call this after freeing large blocks of
615 memory to potentially reduce the system-level memory requirements
616 of a program. However, it cannot guarantee to reduce memory. Under
617 some allocation patterns, some large free blocks of memory will be
618 locked between two used chunks, so they cannot be given back to
619 the system.
620
621 The `pad' argument to malloc_trim represents the amount of free
622 trailing space to leave untrimmed. If this argument is zero,
623 only the minimum amount of memory to maintain internal data
624 structures will be left (one page or less). Non-zero arguments
625 can be supplied to maintain enough trailing space to service
626 future expected allocations without having to re-obtain memory
627 from the system.
628
629 Malloc_trim returns 1 if it actually released any memory, else 0.
630 On systems that do not support "negative sbrks", it will always
631 return 0.
632 */
633 int __malloc_trim(size_t);
634
Freeing from the middle of the chunk is not documented as text in malloc/malloc.c (and malloc_trim description in commend was not updated in 2007) and not documented in man-pages project. Man page from 2012 may be the first man page of the function, written not by authors of glibc. Info page of glibc only mentions M_TRIM_THRESHOLD of 128 KB:
https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html#Malloc-Tunable-Parameters and don't list malloc_trim function https://www.gnu.org/software/libc/manual/html_node/Summary-of-Malloc.html#Summary-of-Malloc (and it also don't document memusage/memusagestat/libmemusage.so).
You may ask Drepper and other glibc developers again as you already did in https://sourceware.org/ml/libc-help/2015-02/msg00022.html "malloc_trim() behaviour", but there is still no reply from them. (Only wrong answers from other users like https://sourceware.org/ml/libc-help/2015-05/msg00007.html https://sourceware.org/ml/libc-help/2015-05/msg00008.html)
Or you may test the malloc_trim with this simple C program (test_malloc_trim.c) and strace/ltrace:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <malloc.h>
int main()
{
int *m1,*m2,*m3,*m4;
printf("%s\n","Test started");
m1=(int*)malloc(20000);
m2=(int*)malloc(40000);
m3=(int*)malloc(80000);
m4=(int*)malloc(10000);
printf("1:%p 2:%p 3:%p 4:%p\n", m1, m2, m3, m4);
free(m2);
malloc_trim(0); // 20000, 2000000
sleep(1);
free(m1);
free(m3);
free(m4);
// malloc_stats(); malloc_info(0, stdout);
return 0;
}
gcc test_malloc_trim.c -o test_malloc_trim, strace ./test_malloc_trim
write(1, "Test started\n", 13Test started
) = 13
brk(0) = 0xcca000
brk(0xcef000) = 0xcef000
write(1, "1:0xcca010 2:0xccee40 3:0xcd8a90"..., 441:0xcca010 2:0xccee40 3:0xcd8a90 4:0xcec320
) = 44
madvise(0xccf000, 36864, MADV_DONTNEED) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffffafbfff0) = 0
brk(0xceb000) = 0xceb000
So, there is madvise with MADV_DONTNEED for 9 pages after malloc_trim(0) call, when there was hole of 40008 bytes in the middle of the heap.
... utilizing madvise(x, MADV_DONTNEED) to release memory back to the
operating system.
madvise(x, MADV_DONTNEED) does not release memory. man madvise:
MADV_DONTNEED
Do not expect access in the near future. (For the time being,
the application is finished with the given range, so the kernel
can free resources associated with it.) Subsequent accesses of
pages in this range will succeed, but will result either in
reloading of the memory contents from the underlying mapped file
(see mmap(2)) or zero-fill-on-demand pages for mappings without
an underlying file.
So, the usage of madvise(x, MADV_DONTNEED) does not contradict man malloc_trim's statement:
This function cannot release free memory located at places other than the top of the heap.

Resources