Minimising ArangoDB memory use - arangodb

I have a number of different deployment situations for ArangoDB. One of which is on a users desktop machine or laptop.
I've read and implemented the instructions on how to run ArangoDB in Spartan Mode (very helpfull).
However, I need more. The desktop user may work with a number of different collections in the database and all of these stay loaded and consume a lot of virtual memory. This can cause some apps to behave differently if they detect they are running in a memory constrained environment.
So, I'm looking for a way to unload collections that haven't been accessed "recently" (some configurable amount of time).
Is there a (good) way to go about doing this?

For version 3.4, I added the following params to arangod.conf to start it in so called spartan mode.
More details can be found at their blog post
[javascript]
# number of V8 contexts available for JavaScript execution. use 0 to
# make arangod determine the number of contexts automatically.
v8-contexts = 1
[foxx]
# enable Foxx queues in the server
# Disable task scheduling - reduce CPU
queues = false
[wal]
# Reduce the number of historic WAL files which will reduce the memory usage when ArangoDB is in use.
historic-logfiles = 1
# Reduce the prepared WAL log files which are kept ready for future write operations
reserve-logfiles = 1
# In addition you can reduce the size of all WAL files to e.g. 8 MB by setting
logfile-size = 8388608

Related

What to expect in terms of performance from my Spark Streaming Application in local mode?

I realize this might be a very broad question, but this is my issue: I developed a Spark Application in Java which uses an algorithm to analyse several JSON messages (1kB of size each) which are received through a socket connection, in one second intervals.
I'm only using 6 map methods, but the functions inside have several loops that can run up to 1000 times each (there are even cases where I have a loop inside a loop which leads to them being run 1000*1000 times in total).
I'm running the application in local mode, that is, with just one node (I assume) to perform the Spark tasks and jobs.
The problem here is that I am taking up to 7 minutes to process one of these messages, which is an insane amount of time, and causes great scheduling delays.
Is this normal given the complexity of my algorithm + running in local mode+ possibly some memory leakage?
If so, how can I proceed to improve the throughput?
Don't know if it helps, but here are some specifications of my computer:
Processor: Intel Core i5, 2.60GHz
RAM: 3.87GB usable memory
64 bit operating system
Thank you so much.

what is connection_cache_size in zodbconn.uri(pyramid-framework-zodb-scaffold)?

In pyramid framework(scaffold zodb project package) there is a line in development.ini
zodbconn.uri = file://%(here)s/Data.fs?connection_cache_size=20000
When pserve development.ini is invoked data.fs, data.fs.index, data.fs.lock and data.fs.tmp is created.
I readily understand that, zodbconn.uri creates zodb db on disk for usage but what is cache_connection_size, the default value of it corresponds to 20000 in the development.ini. What can be the value of it., i.e., what the value is based on?
The parameter configures how many objects the connection caches in RAM. Note that this is an object count, not a size in bytes.
Keeping a number of objects in memory reduces load times, but increases RAM size. What you set the number to then depends on how much RAM you have and how big your objects are.
You may want to monitor your ZODB activity to see how many loads are performed over a given time period, see how much RAM is used at the same time, and adjust the cache size accordingly. Nagios or a similar monitoring / graphing system is ideal there.

Erlang garbage collection

I need your help in investigation of issue with Erlang memory consumption. How typical, isn't it?
We have two different deployment schemes.
In first scheme we running many identical nodes on small virtual machines (in Amazon AWS),
one node per machine. Each machine has 4Gb of RAM.
In another deployment scheme we running this nodes on big baremetal machines (with 64 Gb of RAM), with many nodes per machine. In this deployment nodes are isolated in docker containers (with memory limit set to 4 Gb).
I've noticed, that heap of processes in dockerized nodes are hogging up to 3 times much more RAM, than heaps in non-dockerized nodes with identical load. I suspect that garbage collection in non-dockerized nodes is more aggressive.
Unfortunately, I don't have any garbage collection statistics, but I would like to obtain it ASAP.
To give more information, I should say that we are using HiPE R17.1 on Ubuntu 14.04 with stock kernel. In both schemes we are running 8 schedulers per node, and using default fullsweep_after flag.
My blind suggestion is that Erlang default garbage collection relies (somehow) on /proc/meminfo (which is not actual in dockerized environment).
I am not C-guy and not familiar with emulator internals, so could someone point me to places in Erlang sources that are responsible for garbage collection and some emulator options which I can use to tweak this behavior?
Unfortunately VMs often try to be smarter with memory management than necessary and that not always plays nicely with the Erlang memory management model. Erlang tends to allocate and release a large number of small chunks of memory, which is very different to normal applications, which usually allocate and release a small number of big chunks of memory.
One of those technologies is Transparent Huge Pages (THP), which some OSes enable by default and which causes Erlang nodes running in such VMs to grow (until they crash).
https://access.redhat.com/solutions/46111
https://www.digitalocean.com/company/blog/transparent-huge-pages-and-alternative-memory-allocators/
https://docs.mongodb.org/manual/tutorial/transparent-huge-pages/
So, ensuring THP is switched off is first thing you can check.
The other is trying to tweak the memory options used when starting the Erlang VM itself, for example see this post:
Erlang: discrepancy of memory usage figures
Resulting options that worked for us:
-MBas aobf -MBlmbcs 512 -MEas aobf -MElmbcs 512
Some more theory about memory allocators:
http://www.erlang-factory.com/static/upload/media/139454517145429lukaslarsson.pdf
And more detailed description of memory allocator flags:
http://erlang.org/doc/man/erts_alloc.html
First thing to know, is that garbage collection i Erlang is process based. Each process is GC in their own time, and independently from each other. So garbage collection in your system is only dependent on data in your processes, not operating system itself.
That said, there could be some differencess between memory consumption from Eralang point of view, and System point of view. That why comparing erlang:memory to what your system is saying is always a good idea (it could show you some binary leaks, or other memory problems).
If you would like to understand little more about Erlang internals I would recommend those two talks:
https://www.youtube.com/watch?v=QbzH0L_0pxI
https://www.youtube.com/watch?v=YuPaX11vZyI
And from little better debugging of your memory management I could reccomend starting with http://ferd.github.io/recon/

Do I need to tune sysctl.conf under linux when running MongoDB?

We are seeing occational huge writes to disk in the MongoDB log, effectively locking MongoDB for a long time. Many people are reporting similar issues on the net, but I have found no good answers so far.
Tue Mar 11 09:42:49.818 [DataFileSync] flushing mmaps took 75264ms for 46 files
The average mmap flush on my server is around 100 ms according to the mongo statistics.
A large percentage of our MongDB data is updated within a few hours. This leads me to speculate whether we need to tune the Linux sysctl virtual memory parameters as described in the performance guide for Neo4J, another memory mapped tool: http://docs.neo4j.org/chunked/stable/linux-performance-guide.html
There are a lot of blocks going out to IO, way more than expected for the write speed we
are seeing in the benchmark. Another observation that can be made is that the Linux kernel
has spawned a process called "flush-x:x" (run top) that seems to be consuming a lot of
resources.
The problem here is that the Linux kernel is trying to be smart and write out dirty pages
from the virtual memory. As the benchmark will memory map a 1GB file and do random writes
it is likely that this will result in 1/4 of the memory pages available on the system to
be marked as dirty. The Neo4j kernel is not sending any system calls to the Linux kernel to
write out these pages to disk however the Linux kernel decided to start doing so and it
is a very bad decision. The result is that instead of doing sequential like writes down
to disk (the logical log file) we are now doing random writes writing regions of the
memory mapped file to disk.
TOP shows that we indeed have a flush process that has been running a very long time, so this seems to match.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28352 mongod 20 0 153g 3.2g 3.1g S 3.3 42.3 299:18.36 mongod
3678 root 20 0 0 0 0 S 0.3 0.0 26:27.88 flush-253:1
The recommended Neo4J sysctl settings are
vm.dirty_background_ratio = 50
vm.dirty_ratio = 80
Does these settings have any relevance for a MongoDB installation at all?
The short answer is "yes". What values to choose depends very much on your write patterns. This gives background on exactly how MongoDB manages its mappings - it's not anything unexpected.
One wrinkle is that in a web-facing database application, you may care about latency more than throughput. vm.dirty_background_ratio gives the threshold for starting to write dirty pages, and vm.dirty_ratio tells when to stop accepting new writes (ie, block) until all writes have been flushed.
If you are hammering a relatively small working set, you can be OK with setting both of those values fairly high, and relying on Mongo's (or the OS's) periodic time-based flush-to-disk to commit the writes.
If you're conducting a high volume of inserts and also some modifications, which sounds like it might be your situation, it's a balancing act that depends on inserts vs. rewrites - starting to flush too early will cause writes that will be re-written soon, "wasting" io. Starting to flush too late will result in pauses as you flush huge writes.
If you're doing mostly inserts, then you may very well want a large dirty_ratio (to avoid blocking) and a relatively small dirty_background_ratio (small enough to always be writing as you're inserting to reduce latency, and just large enough to linearize some of the writes).
The correct solution is to replay some dummy data with various options for those sysctl parameters, and optimize it by brute force, bearing in mind your average latency / total throughput objectives.

How to shrink the page talbe size of a process?

mongodb server map all db files into RAM. Along with size of database becoming bigger, the server will has a huge page table which is up to 3G bytes.
Is there a way to shrink it when the server is running?
mongodb version is 2.0.4
Mongodb will memory-map all of the data files that it creates, plus the journal files (if you're using journaling). There is no way to prevent this from happening. This means that the virtual memory size of the MongoDB process will always be roughly twice the size of the data files.
Note that the OS memory management system will page out unused RAM pages, so that the physical memory size of the process will typically be much less than the virtual memory size.
The only way to reduce the virtual memory size of the 'mongod' process is to reduce the size of the MongoDB data files. The only way to reduce the size of the data files is to take the node offline and perform a 'repair'.
See here for more details:
- http://www.mongodb.org/display/DOCS/Excessive+Disk+Space#ExcessiveDiskSpace-RecoveringDeletedSpace
Basically you are asking to do something that the MongoDB manual recommends not to: http://docs.mongodb.org/manual/administration/ulimit/ in this specific scenario. Recommended however does not mean required and it is just a guideline really.
This is just the way MongoDB runs and something you have got to accept unless you wish to toy around and test out different scenarios and how they work.
You probably want to reduce the used memory of the process. You could use the ulimit bash builtin (before starting your server, perhaps in some /etc/rc.d/mongodb script) which calls the setrlimit(2) syscall

Resources