I am using Ray in order to parallelize some computations, but it seems to be accumulating spillage..
I don't mind it spilling objects to my hard drive, but I do if it means using +130 GiB for processing about 1.6 GiB of simulations..
Bellow is a trace of what is happening:
Number of steps: 55 (9,091 simulations each)
0%
[2m[36m(raylet)[0m Spilled 3702 MiB, 12 objects, write throughput 661 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
[2m[36m(raylet)[0m Spilled 5542 MiB, 17 objects, write throughput 737 MiB/s.
2%
[2m[36m(raylet)[0m Spilled 9883 MiB, 33 objects, write throughput 849 MiB/s.
5%
[2m[36m(raylet)[0m Spilled 16704 MiB, 58 objects, write throughput 997 MiB/s.
13%
[2m[36m(raylet)[0m Spilled 32903 MiB, 124 objects, write throughput 784 MiB/s.
29%
[2m[36m(raylet)[0m Spilled 66027 MiB, 268 objects, write throughput 661 MiB/s.
53%
[2m[36m(raylet)[0m Spilled 131920 MiB, 524 objects, write throughput 461 MiB/s.
60%
And here is the code I am running:
def get_res_parallel(simulations, num_loads=num_cpus):
load_size = simulations.shape[0] / num_loads
simulations_per_load = [simulations[round(n * load_size): round((n+1) * load_size)]
for n in range(num_loads)]
# 2D numpy arrays
results = ray.get([get_res_ray.remote(simulations=simulations)
for simulations in simulations_per_load])
return np.vstack(results)
MAX_RAM = 6 * 2**30 # 6 GiB
def get_expected_res(simulations, MAX_RAM=MAX_RAM):
expected_result = np.zeros(shape=87_381, dtype=np.float64)
bytes_per_res = len(expected_result) * (64 // 8)
num_steps = simulations.shape[0] * bytes_per_res // MAX_RAM + 1
step_size = simulations.shape[0] / num_steps
print(f"Number of steps: {num_steps} ({step_size:,.0f} simulations each)")
for n in range(num_steps):
print(f"\r{n / num_steps:.0%}", end="")
step_simulations = simulations[round(n * step_size): round((n+1) * step_size)]
results = get_res_parallel(simulations=step_simulations)
expected_result += results.mean(axis=0)
print(f"\r100%")
return expected_result / num_steps
Running on a Mac M1 with 16 GiB of RAM, Ray 2.0.0 and Python 3.9.13.
Question
Given my code, is it normal behavior?
What can I do to resolve this problem? Force garbage collection?
Do you know the expected size of the array returned by get_res_ray?
Ray will spill objects returned by remote tasks as well as objects passed to remote tasks, so in this case there are two possible places that can cause memory pressure:
The ObjectRefs returned by get_res_ray.remote
The simulations passed to get_res_ray.remote. Since these are large, Ray will automatically put these in the local object store to reduce the size of the task definition.
It may be expected to spill if the size of these objects combined is greater than 30% of the RAM on your machine (this is the default size of Ray's object store). It's not suggested to increase the size of the object store, since this can cause memory pressure on the functions instead.
But you can try to either process fewer things in each iteration and/or you can try to release ObjectRefs sooner. In particular, you should try to release the results from the previous iteration as soon as possible, so that Ray can GC the objects for you. You can do this by calling del results once you're done using them.
Here's a full suggestion that will do the same thing by feeding the array results into another task instead of getting them on the driver. This is usually a better approach because it avoids adding memory pressure on the driver and you're less likely to be accidentally pinning results in the driver's memory.
#ray.remote
def mean(*arrays):
return np.vstack(arrays).mean(axis=0)
def get_res_parallel(simulations, num_loads=num_cpus):
load_size = simulations.shape[0] / num_loads
simulations_per_load = [simulations[round(n * load_size): round((n+1) * load_size)]
for n in range(num_loads)]
# 2D numpy arrays
# Use the * syntax in Python to unpack the ObjectRefs as function arguments.
result = mean.remote(*[get_res_ray.remote(simulations=simulations)
for simulations in simulations_per_load])
# We never have the result arrays stored in driver's memory.
return ray.get(result)
MAX_RAM = 6 * 2**30 # 6 GiB
def get_expected_res(simulations, MAX_RAM=MAX_RAM):
expected_result = np.zeros(shape=87_381, dtype=np.float64)
bytes_per_res = len(expected_result) * (64 // 8)
num_steps = simulations.shape[0] * bytes_per_res // MAX_RAM + 1
step_size = simulations.shape[0] / num_steps
print(f"Number of steps: {num_steps} ({step_size:,.0f} simulations each)")
for n in range(num_steps):
print(f"\r{n / num_steps:.0%}", end="")
step_simulations = simulations[round(n * step_size): round((n+1) * step_size)]
expected_result += get_res_parallel(simulations=step_simulations)
print(f"\r100%")
return expected_result / num_steps
Related
I have a question in the process of learning using pytorch.
referece: https://pytorch.org/docs/stable/cuda.html
current_memory = torch.cuda.memory_allocated(device=device) # return current GPU memory
total_free_memory = torch.cuda.mem_get_info(device=device) # returns the total, unused GPU memory
total = mem_get_info[1]
unused = mem_get_info[0]
used_memory = total - unused
In my code,
total : 11.77GB
used_memory: 4.63GB
current_memory: 3.1GB
I wonder the used_memory - current_memory is a leak memory.
why are the values used_memory and current_memory have different??
Thanks for any help.
What is the best and fastest way to iterate over Collection objects in Groovy. I know there are several Groovy collection utility methods. But they use closures which are slow.
The final result in your specific case might be different, however benchmarking 5 different iteration variants available for Groovy shows that old Java for-each loop is the most efficient one. Take a look at the following example where we iterate over 100 millions of elements and we calculate the total sum of these numbers in the very imperative way:
#Grab(group='org.gperfutils', module='gbench', version='0.4.3-groovy-2.4')
import java.util.concurrent.atomic.AtomicLong
import java.util.function.Consumer
def numbers = (1..100_000_000)
def r = benchmark {
'numbers.each {}' {
final AtomicLong result = new AtomicLong()
numbers.each { number -> result.addAndGet(number) }
}
'for (int i = 0 ...)' {
final AtomicLong result = new AtomicLong()
for (int i = 0; i < numbers.size(); i++) {
result.addAndGet(numbers[i])
}
}
'for-each' {
final AtomicLong result = new AtomicLong()
for (int number : numbers) {
result.addAndGet(number)
}
}
'stream + closure' {
final AtomicLong result = new AtomicLong()
numbers.stream().forEach { number -> result.addAndGet(number) }
}
'stream + anonymous class' {
final AtomicLong result = new AtomicLong()
numbers.stream().forEach(new Consumer<Integer>() {
#Override
void accept(Integer number) {
result.addAndGet(number)
}
})
}
}
r.prettyPrint()
This is just a simple example where we try to benchmark the cost of iteration over a collection, no matter what the operation executed for every element from collection is (all variants use the same operation to give the most accurate results). And here are results (time measurements are expressed in nanoseconds):
Environment
===========
* Groovy: 2.4.12
* JVM: OpenJDK 64-Bit Server VM (25.181-b15, Oracle Corporation)
* JRE: 1.8.0_181
* Total Memory: 236 MB
* Maximum Memory: 3497 MB
* OS: Linux (4.18.9-100.fc27.x86_64, amd64)
Options
=======
* Warm Up: Auto (- 60 sec)
* CPU Time Measurement: On
WARNING: Timed out waiting for "numbers.each {}" to be stable
user system cpu real
numbers.each {} 7139971394 11352278 7151323672 7246652176
for (int i = 0 ...) 6349924690 5159703 6355084393 6447856898
for-each 3449977333 826138 3450803471 3497716359
stream + closure 8199975894 193599 8200169493 8307968464
stream + anonymous class 3599977808 3218956 3603196764 3653224857
Conclusion
Java's for-each is as fast as Stream + anonymous class (Groovy 2.x does not allow using lambda expressions).
The old for (int i = 0; ... is almost twice slower comparing to for-each - most probably because there is an additional effort of returning a value from the array at given index.
Groovy's each method is a little bit faster then stream + closure variant, and both are more than twice slower comparing to the fastest one.
It's important to run benchmarks for a specific use case to get the most accurate answer. For instance, Stream API will be most probably the best choice if there are some other operations applied next to the iteration (filtering, mapping etc.). For simple iterations from the first to the last element of a given collection choosing old Java for-each might give the best results, because it does not produce much overhead.
Also - the size of collection matters. For instance, if we use the above example but instead of iterating over 100 millions of elements we would iterate over 100k elements, then the slowest variant would cost 0.82 ms versus 0.38 ms. If you build a system where every nanosecond matters then you have to pick the most efficient solution. But if you build a simple CRUD application then it doesn't matter if iteration over a collection takes 0.82 or 0.38 milliseconds - the cost of database connection is at least 50 times bigger, so saving approximately 0.44 milliseconds would not make any impact.
// Results for iterating over 100k elements
Environment
===========
* Groovy: 2.4.12
* JVM: OpenJDK 64-Bit Server VM (25.181-b15, Oracle Corporation)
* JRE: 1.8.0_181
* Total Memory: 236 MB
* Maximum Memory: 3497 MB
* OS: Linux (4.18.9-100.fc27.x86_64, amd64)
Options
=======
* Warm Up: Auto (- 60 sec)
* CPU Time Measurement: On
user system cpu real
numbers.each {} 717422 0 717422 722944
for (int i = 0 ...) 593016 0 593016 600860
for-each 381976 0 381976 387252
stream + closure 811506 5884 817390 827333
stream + anonymous class 408662 1183 409845 416381
UPDATE: Dynamic invocation vs static compilation
There is also one more factor worth taking into account - static compilation. Below you can find results for 10 millions element collection iterations benchmark:
Environment
===========
* Groovy: 2.4.12
* JVM: OpenJDK 64-Bit Server VM (25.181-b15, Oracle Corporation)
* JRE: 1.8.0_181
* Total Memory: 236 MB
* Maximum Memory: 3497 MB
* OS: Linux (4.18.10-100.fc27.x86_64, amd64)
Options
=======
* Warm Up: Auto (- 60 sec)
* CPU Time Measurement: On
user system cpu real
Dynamic each {} 727357070 0 727357070 731017063
Static each {} 141425428 344969 141770397 143447395
Dynamic for-each 369991296 619640 370610936 375825211
Static for-each 92998379 27666 93026045 93904478
Dynamic for (int i = 0; ...) 679991895 1492518 681484413 690961227
Static for (int i = 0; ...) 173188913 0 173188913 175396602
As you can see turning on static compilation (with #CompileStatic class annotation for instance) is a game changer. Of course Java for-each is still the most efficient, however its static variant is almost 4 times faster than the dynamic one. Static Groovy each {} is faster 5 times faster than the dynamic each {}. And static for loop is also 4 times faster then the dynamic for loop.
Conclusion - for 10 millions elements static numbers.each {} takes 143 milliseconds while static for-each takes 93 milliseconds for the same size collection. It means that for collection of size 100k static numbers.each {} will cost 0.14 ms and static for-each will take 0.09 ms approximately. Both are very fast and the real difference starts when the size of collection explodes to +100 millions of elements.
Java stream from Java compiled class
And to give you a perspective - here is Java class with stream().forEach() on 10 millions of elements for a comparison:
Java stream.forEach() 87271350 160988 87432338 88563305
Just a little bit faster than statically compiled for-each in Groovy code.
I'm trying to create a zram device on my target device. My target can not allocate memory if the zram disksize is above 100GB, but it's okay with the disksize of 50GB or less.
Is there any limit in setting zram device disksize on Linux? My target device only has 2GB of RAM memory.
I guess you can give a number up to UINT64_MAX - 4095 = 18446744073709547520 on a 64-bit platform.
https://github.com/torvalds/linux/blob/master/drivers/block/zram/zram_drv.h#L101
https://github.com/torvalds/linux/blob/master/drivers/block/zram/zram_drv.c#L1506
https://github.com/torvalds/linux/blob/master/drivers/block/zram/zram_drv.c#L901
So what we have:
... disksize_store(...) {
u64 disksize;
...
// ok, we can give at least UINT64_MAX here.
disksize = unsigned long long memparse(...);
// PAGE_ALIGN, PAGE_SIZE = 1<<12
disksize = PAGE_ALIGN(disksize)
= (((disksize)+((PAGE_SIZE)-1))&(~((typeof(disksize))(PAGE_SIZE)-1)))
= (disksize + ((1<<12)-1))&(~((1<<12)-1))
= (disksize + 4095) & 0xfffffffffffff000
// ^^^^^^^^^^^^^^^ this can overflow
// so max number is UINT64_MAX - 4095 so it doesn't overflow
// otherwise this macro will return 0
...
if (!zram_meta_alloc(..., disksize) {
...
return ...;
}
...
zram->disksize = disksize;
...
}
So let's see into zram_meta_alloc:
... zram_meta_alloc(..., disksize) {
...
num_pages = disksize >> PAGE_SHIFT;
// max num_pages = 0xfffffffffffff = UINT64_MAX >> PAGE_SHIFT
... = vzalloc(num_pages * sizeof(*zram->table));
// ^^^^^^^^^^^^^^^ this can overflow
...
}
vzallloc takes as argument unsigned long. ULONG_MAX should be UINT64_MAX on 64-bit platform. sizeof(*zram->table) is equal to sizeof(unsigned long) + sizeof(unsigned long) + [optional: + sizeof(ktime_t)] + padding (see here). Without padding, assuming 64-bit platform, sizeof(unsigned long) = 8 that should be equal to 8+8[+8] = 16 or 24. But anyway, maximum num_pages is equal to UINT64_MAX >> 12, so to overflow it on 64bit multiplication we would need sizeof(*zram->table) = 2^PAGE_SIZE = 4096, and that shouldn't happen (unless the compiler decides to give over 4000 bytes of padding into the zram->table struct). So we are left with UINT64_MAX - 4095.
So we are left, that the maximum number of disksize is UINT64_MAX-4095. If you give the disksize equal to UINT64_MAX - x, where 0 <= x < 4095, than because of PAGE_ALIGN macro, the disksize will be effectively set to 0. Probably this should be brought up to a kernel developer and they should modify the PAGE_ALIGN macro to support such numbers.
6 days ago to vzalloc calls the call to array_size was added to protect against overflow with this commit.
There is no limit but there is an overhead.
"Note that zram uses about 0.1% of the size of the disk when not in use so a huge zram is wasteful."
https://www.kernel.org/doc/Documentation/blockdev/zram.txt
Also disk_size is a virtual size purely dependent on the input and the compression ratio that receives via chosen alg. Disk-size is the max uncompressed size and general disk parameters.
The only 'actual' control is via mem_limit which is compressed size + disk & zram overheads.
Compression ratio is completely dependent on comp alg chosen from /proc/crypto as zlib & zstd are far more effective but are far slower. It is also very dependent on input as with text zlib & zstd can be over double that what lzo & lz4 will achieve.
If the input is already compressed any alg might garner little to zero compression and without a mem_limit could grab much precious memory from the system.
Mem_limit is the max you are prepared zram to grab from system and a disk-size any more than the compression ratio expected applied to mem_limit is likely a waste.
It will never get used but be part of the 0.1% empty creation overhead.
Maybe try https://github.com/StuartIanNaylor/zram-config
I am very new to opencl and trying my first program. I implemented a simple sinc filtering of waveforms. The code works, however i have two questions:
Once I increase the size of the input matrix (numrows needs to go up to 100 000) I get (clEnqueueReadBuffer failed: OUT_OF_RESOURCES) even though matrix is relatively small (few mb). This is to some extent related to the work group size I think, but could someone elaborate how I could fix this issue ?
Could it be driver issue ?
UPDATE:
leaving groups size None crashes
adjusting groups size for GPU (1,600) and IntelHD (1,50) lets me go up to some 6400 rows. However for larger size it crashes on GPU and IntelHD just freezes and does nothing ( 0% on resource monitor)
2.I have Intel HD4600 and Nvidia K1100M GPU available, however the Intel is ~2 times faster. I understand partially this is due to the fact that I don't need to copy my arrays to internal Intel memory different from my external GPU. However I expected marginal difference. Is this normal or should my code be better optimized to use on GPU ? (resolved)
Thanks for your help !!
from __future__ import absolute_import, print_function
import numpy as np
import pyopencl as cl
import os
os.environ['PYOPENCL_COMPILER_OUTPUT'] = '1'
import matplotlib.pyplot as plt
def resample_opencl(y,key='GPU'):
#
# selecting to run on GPU or CPU
#
newlen = 1200
my_platform = cl.get_platforms()[0]
device =my_platform.get_devices()[0]
for found_platform in cl.get_platforms():
if (key == 'GPU') and (found_platform.name == 'NVIDIA CUDA'):
my_platform = found_platform
device =my_platform.get_devices()[0]
print("using GPU")
#
#Create context for GPU/CPU
#
ctx = cl.Context([device])
#
# Create queue for each kernel execution
#
queue = cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
# queue = cl.CommandQueue(ctx)
prg = cl.Program(ctx, """
__kernel void resample(
int M,
__global const float *y_g,
__global float *res_g)
{
int row = get_global_id(0);
int col = get_global_id(1);
int gs = get_global_size(1);
__private float tmp,tmp2,x;
__private float t;
t = (float)(col)/2+1;
tmp=0;
tmp2=0;
for (int i=0; i<M ; i++)
{
x = (float)(i+1);
tmp2 = (t- x)*3.14159;
if (t == x) {
tmp += y_g[row*M + i] ;
}
else
tmp += y_g[row*M +i] * sin(tmp2)/tmp2;
}
res_g[row*gs + col] = tmp;
}
""").build()
mf = cl.mem_flags
y_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=y)
res = np.zeros((np.shape(y)[0],newlen)).astype(np.float32)
res_g = cl.Buffer(ctx, mf.WRITE_ONLY, res.nbytes)
M = np.array(600).astype(np.int32)
prg.resample(queue, res.shape, (1,200),M, y_g, res_g)
event = cl.enqueue_copy(queue, res, res_g)
print("success")
event.wait()
return res,event
if __name__ == "__main__":
#
# this is the number i need to increase ( up to some 100 000)
numrows = 2000
Gaussian = lambda t : 10 * np.exp(-(t - 50)**2 / (2. * 2**2))
x = np.linspace(1, 101, 600, endpoint=False).astype(np.float32)
t = np.linspace(1, 101, 1200, endpoint=False).astype(np.float32)
y= np.zeros(( numrows,np.size(x)))
y[:] = Gaussian(x).astype(np.float32)
y = y.astype(np.float32)
res,event = resample_opencl(y,'GPU')
print ("OpenCl GPU profiler",(event.profile.end-event.profile.start)*1e-9)
#
# test plot if it worked
#
plt.figure()
plt.plot(x,y[1,:],'+')
plt.plot(t,res[1,:])
Re 1.
Your newlen has to be divisible by 200 because that is what you set as local dimensions (1,200). I increased this to 9600 and that still worked fine.
Update
After your update I would suggest not specifying local dimensions but let implementation to decide:
prg.resample(queue, res.shape, None,M, y_g, res_g)
Also it may improve the performance ifnewlen and numrows were multiply of 16.
It is not a rule that Nvidia GPU must perform better than Intel GPU especially that according to Wikipedia there is not a big difference in GFLOPS between them (549.89 vs 288–432). This GFLOPS comparison should be taken with grain of salt as one algorithm may be more suitable to one GPU than the other. In other words looking by this numbers you may expect one GPU to be typically faster than the other but that may vary from algorithm to algorithm.
Kernel for 100000 rows requires:
y_g: 100000 * 600 * 4 = 240000000 bytes =~ 229MB
res_g: 100000 * 1200 * 4 = 480000000 bytes =~ 457,8MB
Quadro K1100M has 2GB of global memory and that should be sufficient for processing 100000 rows. Intel HD 4600 from what I found is limited by memory in the system so I suspect that shouldn't be a problem too.
Re 2.
The time is not measured correctly. Instead of measuring kernel execution time, the time of copying data back to host is being measured. So no surprise that this number is lower for CPU. To measure kernel execution time do:
event = prg.resample(queue, res.shape, (1,200),M, y_g, res_g)
event.wait()
print ("OpenCl GPU profiler",(event.profile.end-event.profile.start)*1e-9)
I don't know how to measure the whole thing including copying data back to host using OpenCL profiling events in pyopencl but using just python gives similar results:
start = time.time()
... #code to be measured
end = time.time()
print(end - start)
I think I figured out the issue:
IntelHd : turning off profiling fixes everything. Can run the code without any issues.
K1100M GPU still crashes but I suspect that this might be the timeout issue as I am using the same video card on my display.
I trying to use this package in Github for string matching. My dictionary is 4 MB. When creating the Trie, I got fatal error: runtime: out of memory. I am using Ubuntu 14.04 with 8 GB of RAM and Golang version 1.4.2.
It seems the error come from the line 99 (now) here : m.trie = make([]node, max)
The program stops at this line.
This is the error:
fatal error: runtime: out of memory
runtime stack:
runtime.SysMap(0xc209cd0000, 0x3b1bc0000, 0x570a00, 0x5783f8)
/usr/local/go/src/runtime/mem_linux.c:149 +0x98
runtime.MHeap_SysAlloc(0x57dae0, 0x3b1bc0000, 0x4296f2)
/usr/local/go/src/runtime/malloc.c:284 +0x124
runtime.MHeap_Alloc(0x57dae0, 0x1d8dda, 0x10100000000, 0x8)
/usr/local/go/src/runtime/mheap.c:240 +0x66
goroutine 1 [running]:
runtime.switchtoM()
/usr/local/go/src/runtime/asm_amd64.s:198 fp=0xc208518a60 sp=0xc208518a58
runtime.mallocgc(0x3b1bb25f0, 0x4d7fc0, 0x0, 0xc20803c0d0)
/usr/local/go/src/runtime/malloc.go:199 +0x9f3 fp=0xc208518b10 sp=0xc208518a60
runtime.newarray(0x4d7fc0, 0x3a164e, 0x1)
/usr/local/go/src/runtime/malloc.go:365 +0xc1 fp=0xc208518b48 sp=0xc208518b10
runtime.makeslice(0x4a52a0, 0x3a164e, 0x3a164e, 0x0, 0x0, 0x0)
/usr/local/go/src/runtime/slice.go:32 +0x15c fp=0xc208518b90 sp=0xc208518b48
github.com/mf/ahocorasick.(*Matcher).buildTrie(0xc2083c7e60, 0xc209860000, 0x26afb, 0x2f555)
/home/go/ahocorasick/ahocorasick.go:104 +0x28b fp=0xc208518d90 sp=0xc208518b90
github.com/mf/ahocorasick.NewStringMatcher(0xc208bd0000, 0x26afb, 0x2d600, 0x8)
/home/go/ahocorasick/ahocorasick.go:222 +0x34b fp=0xc208518ec0 sp=0xc208518d90
main.main()
/home/go/seme/substrings.go:66 +0x257 fp=0xc208518f98 sp=0xc208518ec0
runtime.main()
/usr/local/go/src/runtime/proc.go:63 +0xf3 fp=0xc208518fe0 sp=0xc208518f98
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2232 +0x1 fp=0xc208518fe8 sp=0xc208518fe0
exit status 2
This is the content of the main function (taken from the same repo: test file)
var dictionary = InitDictionary()
var bytes = []byte(""Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney) says they will pay this on the usual credit terms (30 days).")
var precomputed = ahocorasick.NewStringMatcher(dictionary)// line 66 here
fmt.Println(precomputed.Match(bytes))
Your structure is awfully inefficient in terms of memory, let's look at the internals. But before that, a quick reminder of the space required for some go types:
bool: 1 byte
int: 4 bytes
uintptr: 4 bytes
[N]type: N*sizeof(type)
[]type: 12 + len(slice)*sizeof(type)
Now, let's have a look at your structure:
type node struct {
root bool // 1 byte
b []byte // 12 + len(slice)*1
output bool // 1 byte
index int // 4 bytes
counter int // 4 bytes
child [256]*node // 256*4 = 1024 bytes
fails [256]*node // 256*4 = 1024 bytes
suffix *node // 4 bytes
fail *node // 4 bytes
}
Ok, you should have a guess of what happens here: each node weighs more than 2KB, this is huge ! Finally, we'll look at the code that you use to initialize your trie:
func (m *Matcher) buildTrie(dictionary [][]byte) {
max := 1
for _, blice := range dictionary {
max += len(blice)
}
m.trie = make([]node, max)
// ...
}
You said your dictionary is 4 MB. If it is 4MB in total, then it means that at the end of the for loop, max = 4MB. It it holds 4 MB different words, then max = 4MB*avg(word_length).
We'll take the first scenario, the nicest one. You are initializing a slice of 4M of nodes, each of which uses 2KB. Yup, that makes a nice 8GB necessary.
You should review how you build your trie. From the wikipedia page related to the Aho-Corasick algorithm, each node contains one character, so there is at most 256 characters that go from the root, not 4MB.
Some material to make it right: https://web.archive.org/web/20160315124629/http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf
The node type has a memory size of 2084 bytes.
I wrote a litte program to demonstrate the memory usage: https://play.golang.org/p/szm7AirsDB
As you can see, the three strings (11(+1) bytes in size) dictionary := []string{"fizz", "buzz", "123"} require 24 MB of memory.
If your dictionary has a length of 4 MB you would need about 4000 * 2084 = 8.1 GB of memory.
So you should try to decrease the size of your dictionary.
Set resource limit to unlimited worked for me
if ulimit -a return 0 run ulimit -c unlimited
Maybe set a real size limit to be more secure