Should I disable NUMA if my application is not NUMA-Aware

Should I disable NUMA if my application is not NUMA-Aware - windows-server-2008-r2

Running Windows Server 2008 R2 SP1. The application I'm running was not designed with NUMA in mind. Would it be better to disable NUMA on my dual-socket system? My guess is yes, but I wanted to confirm. My server is a Westmere dual-socket system.

If your application is not multithreaded or is multithreaded but does not employ the threads to work simultaneously on the same problem (e.g. is not data parallel), then you can simply bind the program to one of the NUMA nodes. This can be done with various tools, e.g. with the "Set Affinity..." context menu command in Windows Task Manager. If your program is parallel, then you can still use half of the available process cores and bind to one NUMA node.
Note that remote memory accesses on Westmere systems are not that expensive - the latency is 1.6x higher than local access and the bandwidth is almost the same as the local on, therefore if you do a lot of processing on each memory value the impact would be minimal. On the other hand, disabling NUMA on such systems results in fine-mesh interleave of both NUMA domains which makes all applications perform equally bad as roughly 50% of all memory accesses will be local and 50% will be remote.

If I understand correctly, turning NUMA on cannot harm the performance.
If your application is not NUMA aware, accesses will be managed by the OS, so might be across NUMA nodes or might be on the same one - depending on what other pressures the OS has, how much memory / CPU you're using, etc. The OS will try to get your data fast.
If you have it turned off, the OS doesn't know enough to even try to keep each application's execution CPU close to it's memory.

Related

Why Linux distributes threads among NUMA nodes almost equally?

I'm running an application with multiple threads and it seems Linux is distributing threads among NUMA nodes almost equally. Say my application spawns 4 threads and my machine has 4 sockets. I observe that each thread is assigned to a NUMA node distributing threads among all nodes almost equally.
Is there any reason for this? why not assign all on one socket and then fill the next one?

The best binding for an application is dependent of what the application does. It is often a good idea to spread thread on different NUMA nodes so to maximize the memory throughput as all NUMA nodes can theoretically be used in this case (assuming the application is well written and NUMA aware). If all threads are bound to the same NUMA node, then only the memory of the node can be efficiently accessed (access to memory of other NUMA node is possible but slower and pages will not be automatically efficiently map due to the first touch policy which is generally the default one on most machine). When some threads communicate a lot, it is often better to put them on the same NUMA node so not to pay latency overheads. In some cases, it can even be better to put them on the same core (but different hardware threads) so to speed up synchronization operations like locks and atomics.
If you want the scheduling and the binding to be efficient, you need to provide more information to the OS or do it yourself. I strongly advise you to bind threads to specific cores. This is easy with HPC runtimes/tools like OpenMP (but a pain if your application use low-level threads unless you do not care about platform portability). As for NUMA, you can specify the policy using numactl. More information is provided in this answer.
In practice, HPC applications generally use manual binding so to improve performance. OS scheduler are generally not very good to bind thread automatically efficiently. Few years ago, there was even bugs in the scheduler causing inefficient behaviours: see The Linux Scheduler: a Decade of Wasted Cores. To my knowledge, such problem is not so uncommon in this field and not restricted to Linux. Efficient NUMA-aware OS scheduling is far from being easy.

Ubuntu Firebird2 Classic Server config to use all CPU

I am using Firebird 2.0.7 CS on my Ubuntu 16.04 Server. It is not possible to upgrade to a higher version due to the software used, which requires a lower one.
I've used the SuperServer version before, but on Linux the parameter CpuAffinityMask is ignored.
The SuperServer version works tragically because on Linux it uses only 1 core.
The ClassicServer version is a little better, because it assigns 1 core to the 1 user.
When I run demanding task in program, fb_inet_server use 100% of 1 core, but other 23 cores are idle. How I can assign more cores to this process?

The CpuAffinityMask setting is only for SuperServer (and then only for Windows).
If you are using Classic Server, then Firebird can (and it will) use all cores if there is sufficient activity, however the Firebird processes need to coordinate their effort, which - if there is a lot of lock contention - can lead to reduced performance.
To reduce lock contention, you may want to increase the LockHashSlots setting.
Increasing the number of page buffers may also help, but keep in mind that with Classic Server, this setting is per process and can increase memory usage.
Contrary to what you state, Firebird does not "assign[s] 1 core to the 1 user.". Classic Server will create a process per connection, and the threads of these processes will be scheduled by the OS on any available core.

do we need to disable swap for riak?

I just found in the riak documentation that the swap makes the server unresponsive so it has to be disabled.It is also given that Riak node be allowed to be killed by the kernel if it uses too much RAM. If swap is completely disabled, Riak will simply exit. I am confused should we have to disable the swap or not?

http://docs.basho.com/riak/latest/cookbooks/Linux-Performance-Tuning/
Swap Space
Due to the heavily I/O-focused profile of Riak, swap usage
can result in the entire server becoming unresponsive. Disable swap or
otherwise implement a solution for ensuring Riak's process pages are
not swapped.
Basho recommends that the Riak node be allowed to be killed by the
kernel if it uses too much RAM. If swap is completely disabled, Riak
will simply exit when it is unable to allocate more RAM and leave a
crash dump (named erl_crash.dump) in the /var/log/riak directory which
can be used for forensics (by Basho Client Services Engineers if you
are a customer).
So no, you don't have to ... but if you don't and you use all your available RAM the machine is likely to become unresponsive.
With any (unbounded) application that performs heavy I/O where you could exhaust your system's memory that's going to be the case. Typically you would have monitoring on the machine that warned you of memory usage going past a threshold.

Making an Operating System Give More Memory Access to Erlang

I have always run erlang applications on powerful servers. However, sometimes, you cannot avoid such memory errors, especially when users are many
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 467078560 bytes of memory (of type "heap").
What makes it more annoying is that you have a server with 20GB of RAM, with say 8 cores. Looking at the memory which erlang says, it could not allocate and that is why it crashed, is also disturbing , because it is very little memory compared to what the server has in stock.
My question today (i wish it is not closed) , is that, what Operating system configurations can be done (consider RedHat , Solaris, Ubuntu or Linux in general), to make it offer the erlang VM more memory when it needs it ? If one is to run an erlang application on such capable servers, what memory consideration (outside erlang) should be made as regards the underlying operating system ? Problem Background Erlang consumes Main Memory, especially when processes are in thousands. I am running a Web service using Yaws Web Server. On the same node, i have Mnesia running with about 3 ram_copies tables. Its a notification system, as part of a larger Web application running on an intranet. Users access this very system via JSONP from the main application running off a different web server and a different hardware as well. Each user connection queries mnesia directly for any data it needs. However, as users increase i always get the crash dump. I have tweaked the application itself as much as possible, clean up the code to standard, used more binaries than strings e.t.c. avoided single points like gen_servers between yaws processes and mnesia, so that each connection, just hits mnesia directly. The server is very capable with lots of RAM and Disc Space. However, my node crashes when it needs a little more memory, thats why i need to find a way of forcing the Operating system to expand more memory to erlang. Operating system is REDHAT ENTERPRISE 6

It is probably because you are running in 32bit mode where only approx 4 GB of RAM is addressable. Try switching to the 64bit version of erlang and try again.

Several various server tutorials I have read say that if the service runs as a non root user, you may have to edit the /etc/security/limits.conf to allow that user to access more memory than it is typically allowed. the example below lets user fooservice use 2GB.
fooservice hard memlock 2097152

Physical Cores vs Virtual Cores in Parallelism

When it comes to virtualization, I have been deliberating on the relationship between the physical cores and the virtual cores, especially in how it effects applications employing parallelism. For example, in a VM scenario, if there are less physical cores than there are virtual cores, if that's possible, what's the effect or limits placed on the application's parallel processing? I'm asking, because in my environment, it's not disclosed as to what the physical architecture is. Is there still much advantage to parallelizing if the application lives on a dual core VM hosted on a single core physical machine?

Is there still much advantage to parallelizing if the application lives on a dual core VM hosted on a single core physical machine?
Always.
The OS-level parallel processing (i.e., Linux pipelines) will improve performance dramatically, irrespective of how many cores -- real or virtual -- you have.
Indeed, you have to create fairly contrived problems or really dumb solutions to not see performance improvements from simply breaking a big problem into lots of smaller problems along a pipeline.
Once you've got a pipelined solution, and it ties up 100% of your virtual resources, you have something you can measure.
Start trying different variations on logical and physical resources.
But only after you have an OS-level pipeline that uses up every available resource. Until then, you've got fundamental work to do just creating a pipeline solution.

Since you included the F# tag, and you're interested in parallel performance, I'll assume that you're using F# asynchronous IO, hence threads never block, they just swap between CPU bound tasks.
In this case it's ideal to have the same number of threads as the number of virtual cores (at least based on my experiments with F# in Ubuntu under Virtualbox hosted by Windows 7). Having more threads than that decreases performance slightly, having less decreases performance quite a bit.
Also, having more virtual cores than physical cores decreases performance a little. But if this is something you can't control, just make sure you have an active worker thread for each virtual core.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string