CPU and memory usage of jemalloc as compared to glibc malloc - malloc

I had recently learnt about jemalloc, it is the memory allocator used by firefox. I have tried integrating jemalloc into my system by overriding new and delete operator and calling the jemalloc equivalents of malloc and free i.e je_malloc and je_free.I have written a test application that does 100 million allocations.I have run the application both with glibc malloc and jemalloc, while running with jemalloc takes lesser time for such allocations the CPU utilization is pretty high, moreover the the memory foot print is also larger as compared to malloc. After reading this document on jemalloc analysis
it seemed that jemalloc might have footprints greater than malloc as it employs techniques to optimize speed than memory. However, I haven't got any pointers to the CPU usage with Jemalloc. I would like to state that I working on a multiprocessor machine the details of which are given below.
processor : 11
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU X5680 # 3.33GHz
stepping : 2
cpu MHz : 3325.117
cache size : 12288 KB
physical id : 1
siblings : 12
core id : 10
cpu cores : 6
apicid : 53
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx pdpe1gb rdtscp lm constant_tsc ida nonstop_tsc arat pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 6649.91
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
I am using top -c -b -d 1.10 -p 24670 | awk -v time=$TIME '{print time,",",$9}' to keep track of the CPU usage.
Did someone have similar experiences while integrating Jemlloc?
Thanks!

One wise guy said on CppCon that you never have to guess about performance. You have to measure it instead.
I tried to use jemalloc with multithreaded Linux application. It was custom application level protocol server (over TCP/IP). This C++ application used some Java code via JNI (near 5% of time it used Java, and 95% of time it used C++ code) I run 2 application instances in production mode. Each one had 150 threads.
After 72 hours of running glibc one used 900 M of memory, and jemalloc one used 2.2 G of memory. I didn't see significant CPU usage difference. Actual performance (average client request serving time) was near the same for both instances.
So, in my test glibc was much better than jemalloc. Of course, it is my application specific.
Conclusion: If you have reasons to think that your application memory management is not effective because of fragmentation, you have to make test similar to one I described. It is the only reliable information source for your specific needs. If jemalloc is always better that glibc, glibc will make jemalloc its official allocator. If glibc is always better, jemalloc will stop to exist. When competitors exist long time in parallel, it means that each one has its own usage niche.

Aerospike implemented jemalloc on our NoSQL database, and publicly released the implementation about a year ago with v3.3.x. Just today Psi Mankoski published an article on High Scalability about why and how we did it, and the performance improvement it gave compared to GlibC malloc.
We actually saw a decrease in RAM utilization because of the way we were able to use jemalloc's debugging capability to minimize RAM fragmentation. In the production environment, server % Free Memory was often a "spiky graph," and had often spiked as high as 54% prior to the implementation of JEMalloc. After implementation, you can see the decrease in RAM utilization over the 4-month analysis period. RAM % free memory began to "flatline" and be far more predictable, hovering between ~22-40% depending on the server node.
As Preet says, there was a lot less fragmentation over time, which means less RAM utilization. Psi's article gives "proof in the pudding" behind such a statement.

This question might not belong here since for real-world solutions, it should be irrelevant what other people found on their different hardware/environments/usage scenarios. You should test on the target system and see what suits you.
As for the higher memory footprint, one of the most classical performance optimizations in computer science is the time-memory tradeoff. That is, caching certain results for instant lookup later on and preventing frequent recalculation. Also, since it is presumably a lot more complex, there would probably be a lot more internal bookkeeping. This kind of tradeoff should be more or less expected, especially when picking between variants of such low level and widely used core modules. You have to cater the peformance characteristics to your usage characteristics, since usually, there is no silver bullet.
You might also want to look at google's TCMalloc which is quite close although I believe Jemalloc is slightly more performant in general, as well as creating less heap fragmentation over time.

I am developing simple NoSQL database.
(https://github.com/nmmmnu/HM4)
jemalloc vs standard malloc
When I use jemalloc, performance decrease, but memory "fragmentation" decreases as well. Jemalloc also seems to use less memory on the peak, but difference is 5-6%.
What I mean with memory fragmentation is as follows.
First I allocate lots of key value pairs (5-7 GB of memory)
Then I look at the memory usage.
Then I deallocate all pairs and any other memory my executable uses. Order of allocation is same as order of deallocation.
Finally I check memory usage again.
In standard malloc, usage is almost like on the peak. (I especially checked for mmap memory and there is none).
With jemalloc usage is minimal.
bonus information - tcmalloc
Last time I checked with tcmalloc, it was really very fast - probably 10% improvements over standard malloc.
On the peak, it consumes less memory than standard malloc, but more than jemalloc.
I do not remember about the memory fragmentation, but it was far from jemalloc result.

This paper investigates the performance of different memory allocators.
Share some conclusions here:
Figure 1 shows the effects of different allocation strategies on TPC-DS with scale factor 100. We measure memory consumption and execution time with our multi-threaded database system on a 4-socket Intel Xeon server. In this experiment, our DBMS executes the query set sequentially using all available cores. Even this relatively simple workload already results in significant performance and memory usage differences. Our database linked with jemalloc can reduce the execution time to 1/2 in comparison to linking it with the
standard malloc of glibc 2.23.

Related

Enabling AVX512 support on compilation significantly decreases performance

I've got a C/C++ project that uses a static library. The library is built for 'skylake' architecture. The project is a data processing module, i.e. it performs many arithmetic operations, memory copying, searching, comparing, etc.
The CPU is Xeon Gold 6130T, it supports AVX512. I tried to compile my project with both -march=skylake and -march=skylake-avx512 and then link with the library.
In case of using -march=skylake-avx512 the project performance is significantly decreased (by 30% on average) in comparison to the project built with -march=skylake.
How can this be explained? What could be the reason?
Info:
Linux 3.10
gcc 9.2
Intel Xeon Gold 6130T
project performance is significantly decreased (by 30% on average)
In code that cannot be easily vectorized sporadic AVX instructions here and there downclock your CPU but do not provide any benefit. You may like to turn off AVX instructions completely in such scenarios.
See Advanced Vector Extensions, Downclocking:
Since AVX instructions are wider and generate more heat, Intel processors have provisions to reduce the Turbo Boost frequency limit when such instructions are being executed. The throttling is divided into three levels:
L0 (100%): The normal turbo boost limit.
L1 (~85%): The "AVX boost" limit. Soft-triggered by 256-bit "heavy" (floating-point unit: FP math and integer multiplication) instructions. Hard-triggered by "light" (all other) 512-bit instructions.
L2 (~60%): The "AVX-512 boost" limit. Soft-triggered by 512-bit heavy instructions.
The frequency transition can be soft or hard. Hard transition means the frequency is reduced as soon as such an instruction is spotted; soft transition means that the frequency is reduced only after reaching a threshold number of matching instructions. The limit is per-thread.
Downclocking means that using AVX in a mixed workload with an Intel processor can incur a frequency penalty despite it being faster in a "pure" context. Avoiding the use of wide and heavy instructions help minimize the impact in these cases. AVX-512VL is an example of only using 256-bit operands in AVX-512, making it a sensible default for mixed loads.
Also, see
On the dangers of Intel's frequency scaling.
Gathering Intel on Intel AVX-512 Transitions.
How to Fix Intel?.

Logging all memory accesses of any executable/process in Linux

I have been looking for a way to log all memory accesses of a process/execution in Linux. I know there have been questions asked on this topic previously here like this
Logging memory access footprint of whole system in Linux
But I wanted to know if there is any non-instrumentation tool that performs this activity. I am not looking for QEMU/ VALGRIND for this purpose since it would be a bit slow and I want as little overhead as possible.
I looked at perf mem and PEBS events like cpu/mem-loads/pp for this purpose but I see that they will collect only sampled data and I actually wanted the trace of all the memory accesses without any sampling.
I wanted to know is there any possibility to collect all memory accesses without wasting too much on overhead by using a tool like QEMU. Is there any possibility to use PERF only but without samples so that I get all the memory access data ?
Is there any other tool out there that I am missing ? Or any other strategy that gives me all memory access data ?
It is just impossible both to have fastest possible run of Spec and all memory accesses (or cache misses) traced in this run (using in-system tracers). Do one run for timing and other run (longer,slower), or even recompiled binary for memory access tracing.
You may start from short and simple program (not the ref inputs of recent SpecCPU, or billion mem accesses in your big programs) and use perf linux tool (perf_events) to find acceptable ratio of memory requests recorded to all memory requests. There is perf mem tool or you may try some PEBS-enabled events of memory subsystem. PEBS is enabled by adding :p and :pp suffix to the perf event specifier perf record -e event:pp, where event is one of PEBS events. Also try pmu-tools ocperf.py for easier intel event name encoding and to find PEBS enabled events.
Try to find the real (maximum) overhead with different recording ratios (1% / 10% / 50%) on the memory performance tests. Check worst case of memory recording overhead at left part on the Arithmetic Intensity scale of [Roofline model](https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/. Typical tests from this part are: STREAM (BLAS1), RandomAccess (GUPS) and memlat are almost SpMV; many real tasks are usually not so left on the scale:
STREAM test (linear access to memory),
RandomAccess (GUPS) test
some memory latency test (memlat of 7z, lat_mem_rd of lmbench).
Do you want to trace every load/store commands or you only want to record requests that missed all (some) caches and were sent to main RAM memory of PC (to L3)?
Why you want no overhead and all memory accesses recorded? It is just impossible as every memory access have tracing of several bytes (the memory address, sometimes: instruction address) to be recorded to the same memory. So, having memory tracing enabled (more than 10% or memory access tracing) clearly will limit available memory bandwidth and the program will run slower. Even 1% tracing can be noted, but it effect (overhead) is smaller.
Your CPU E5-2620 v4 is Broadwell-EP 14nm so it may have also some earlier variant of the Intel PT: https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt https://github.com/01org/processor-trace and especially Andi Kleen's blog on pt: http://halobates.de/blog/p/410 "Cheat sheet for Intel Processor Trace with Linux perf and gdb"
PT support in hardware: Broadwell (5th generation Core, Xeon v4) More overhead. No fine grained timing.
PS: Scholars who study SpecCPU for memory access worked with memory access dumps/traces, and dumps were generated slowly:
http://www.bu.edu/barc2015/abstracts/Karsli_BARC_2015.pdf - LLC misses recorded to offline analysis, no timing was recorded from tracing runs
http://users.ece.utexas.edu/~ljohn/teaching/382m-15/reading/gove.pdf - all load/stores instrumented by writing into additional huge tracing buffer to periodic (rare) online aggregation. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core.
http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf (by Aamer Jaleel of Intel Corporation, VSSAD) - Pin-based instrumentation - program code was modified and instrumented to write memory access metadata into buffer. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core. The paper lists and explains instrumentation overhead and Caveats:
Instrumentation Overhead: Instrumentation involves
injecting extra code dynamically or statically into the
target application. The additional code causes an
application to spend extra time in executing the original
application ... Additionally, for multi-threaded
applications, instrumentation can modify the ordering of
instructions executed between different threads of the
application. As a result, IDS with multi-threaded
applications comes at the lack of some fidelity
Lack of Speculation: Instrumentation only observes
instructions executed on the correct path of execution. As
a result, IDS may not be able to support wrong-path ...
User-level Traffic Only: Current binary instrumentation
tools only support user-level instrumentation. Thus,
applications that are kernel intensive are unsuitable for
user-level IDS.

Simulate an older machine respecting its overall capabilities?

Note I recon my question's main goal isn't programming-related; its means though are best known to programmers (for a reason ;)) That said, feel free to advise some other place should you believe it'll be viewed by better informed people in this complex field.
We're getting close to deploy refurbished PCs with carefully optimized systems at the local #diy facility I'm volunteering in. It's « The devil is in the details » time.
Question 1: What would be your specs sheet to simulating older machines keeping reasonably close to their overall capabilities?
Goal
Help testing OSes and apps capabilities on a few end-user IRL scenarios: kids/guests/digital-iliterate-person as well as retro-gaming/experiment-and-discover scenarios. And then configure/optimize the chosen setups.
Means we oughta garanty --and therefore test-- the box with optimized system duo capabilities for (
browse the today Web -- html5 incl.
Plays videos
View photos
Edit office documents
), that's it!
Target
Circa 2000-2005 boxes: mostly SSE (3DNow!) CPUs such as Ahtlons (k7, 32-bit) and Pentium (P6), 7200 IDE hdds, FSB 400/DDR400 (or below) SDRAM, and "some kind of" AGP graphics.
Means
Specs sheet, atm it looks at the target's:
cpu instructions *
cpu frequency *
FSB/bus speed and bandwidth
IO speed and bandwidth *
graphics *
* is possibly addressed in my current set-up (i.e. "well enough").
Current old box emulation set-up
Core i3 #3.3ghz with DDR3 1600 running Arch linux [1]
virtualization set up
cpulimit -l 60 \
qemu-system-i386 -cpu pentium3,enforce -enable-kvm \
-m 1G -vga std -display gtk -enable-kvm -hda hdd.img
First, it tries to stay below-or-equal the target CPu frequency;
b) to its instructions set:
/proc/cpuinfo (and google chrome ;) )
~$ test#guest inxi -f
CPU: Single core Pentium III (Katmai) (-UP-) cache: 2048 KB clocked at 3292.518 MHz
CPU Flags: apic cmov cx8 de fpu fxsr hypervisor mca mce mmx msr mtrr pae pge pse pse36
sep sse tsc vme x2apic
c) to the available RAM,
d) hdd.img stands on a spinning 7200 Sata HDD and has qcow2 format to try sticking closer to the target IO specs.
e) to the older 2D rendering capabilities AFAIK Qemu/KVM -vga std great gpu emulation ;) makes it a good choice to simulate that
Question two: Are the following limits really impacting?
It's unclear whether Qemu KVM succeeds to restrict the vcpu to the chosen cpu model e.g. shouldn't -cpu pentium3 show a 250 KB cache?
IO subsystem (big point here especially for finding out the correct Linux kernel virtual memory settings). It'd be really cool to stick to less-or-equal an IDE bus and 20 GB HDD efficiency.
What about the FSB bus/memory subsystem part??
Alyernatively, do uou know of tool, tips or tricks to achieve a better "reasonably equal-or-less" to the target capabilities VM set-up?
[1] limited time opening at the diy facility make it necessary to fine tune at my place. Host has VT-x set but no VT-d capabilities.
EDIT: How can I simulate a slow machine in a VM? and
Slow down CPU to simulate slower computers in browser testing, cover the cpu speed part. How To Simulate Lower CPU Processor Machines For Browser Testing and
Emulate old PC? [closed] too and are centered on Windows host.

How is memory segmentation bounds-checking done?

According to the wikipedia article on memory segmentation, x86 processors do segmentation bounds-checking in hardware. Are there any systems that do the bounds-checking in software? If so, what kind of overhead does that incur? In the hardware implementations, is there any way to skip the bounds checking to avoid the penalty (if there is a penalty)?
All modern languages do bounds checking in software, on top of segment bounds checking and memory map lookups. One benchmark suggests the overhead is about 0.5%. This is a small price to pay for stability and security.
A 486 could load a memory location in a single cycle, and CPUs have only gotten better at doing on-chip processing, so it's unlikely that segmentation bounds checking has any overhead at all.
Still, though, you can simply run in 64bit mode: "The processor does not perform segment limit checks at runtime in 64-bit mode" (Intel's developer manual, 3.2.4).

What are the advantages of a 64-bit processor?

Obviously, a 64-bit processor has a 64-bit address space, so you have more than 4 GB of RAM at your disposal. Does compiling the same program as 64-bit and running on a 64-bit CPU have any other advantages that might actually benefit programs that aren't enormous memory hogs?
I'm asking about CPUs in general, and Intel-compatible CPUs in particular.
There's a great article on Wikipedia about the differences and benefits of 64bit Intel/AMD cpus over their 32 bit versions. It should have all the information you need.
Some on the key differences are:
16 general purpose registers instead of 8
Additional SSE registers
A no execute (NX) bit to prevent buffer overrun attacks
The main advantage of a 64-bit CPU is the ability to have 64-bit pointer types that allow virtual address ranges greater than 4GB in size. On a 32-bit CPU, the pointer size is (typically) 32 bits wide, allowing a pointer to refer to one of 2^32 (4,294,967,296) discrete addresses. This allows a program to make a data structure in memory up to 4GB in size and resolve any data item in it by simply de-referencing a pointer. Reality is slightly more complex than this, but for the purposes of this discussion it's a good enough view.
A 64-bit CPU has 64-bit pointer types that can refer to any address with a space with 2^64 (18,446,744,073,709,551,616) discrete addresses, or 16 Exabytes. A process on a CPU like this can (theoretically) construct and logically address any part of a data structure up to 16 Exabytes in size by simply de-referencing a pointer (looking up data at an address held in the pointer).
This allows a process on a 64-bit CPU to work with a larger set of data (constrained by physical memory) than a process on a 32 bit CPU could. From the point of view of most users of 64-bit systems, the principal advantage is the ability for applications to work with larger data sets in memory.
Aside from that, you may get a native 64-bit integer type. A 64 bit integer makes arithmetic or logical operations using 64 bit types such as C's long long faster than one implemented as two 32-bit operations. Floating point arithmetic is unlikely to be significantly affected, as FPU's on most modern 32-bit CPU's natively support 64-bit double floating point types.
Any other performance advantages or enhanced feature sets are a function of specific chip implementations, rather than something inherent to a system having a 64 bit ALU.
This article may be helpful:
http://www.softwaretipsandtricks.com/windowsxp/articles/581/1/The-difference-between-64-and-32-bit-processors
This one is a bit off-topic, but might help if you plan to use Ubuntu:
http://ubuntuforums.org/showthread.php?t=368607
And this pdf below contains a detailed technical specification:
http://www.plmworld.org/access/tech_showcase/pdf/Advantage%20of%2064bit%20WS%20for%20NX.pdf
Slight correction. On 32-bit Windows, the limit is about 3GB of RAM. I believe the remaining 1GB of address space is reserved for hardware. You can still install 4GB, but only 3 will be accessable.
Personally I think anyone who hasn't happily lived with 16K on an 8-bit OS in a former life should be careful about casting aspersions against some of today's software starting to become porcine. The truth is that as our resources become more plentiful, so do our expectations. The day is not long off when 3GB will start to seem ridiculously small. Until that day, stick with your 32-bit OS and be happy.
About 1-3% of speed increase due to instruction level parallelism for 32-bit calculations.
Just wanted to add a little bit of information on the pros and cons of 64-bit CPUs. https://blogs.msdn.microsoft.com/joshwil/2006/07/18/should-i-choose-to-take-advantage-of-64-bit/
The main difference between 32-bit processors and 64-bit processors is the speed they operate. 64-bit processors can come in dual core, quad core, and six core versions for home computing (with eight core versions coming soon). Multiple cores allow for increase processing power and faster computer operation. Software programs that require many calculations to function operate faster on the multi-core 64-bit processors, for the most part. It is important to note that 64-bit computers can still use 32-bit based software programs, even when the Windows operating system is a 64-bit version.
Another big difference between 32-bit processors and 64-bit processors is the maximum amount of memory (RAM) that is supported. 32-bit computers support a maximum of 3-4GB of memory, whereas a 64-bit computer can support memory amounts over 4 GB. This is important for software programs that are used for graphical design, engineering design or video editing, where many calculations are performed to render images, drawings, and video footage.
One thing to note is that 3D graphic programs and games do not benefit much, if at all, from switching to a 64-bit computer, unless the program is a 64-bit program. A 32-bit processor is adequate for any program written for a 32-bit processor. In the case of computer games, you'll get a lot more performance by upgrading the video card instead of getting a 64-bit processor.
In the end, 64-bit processors are becoming more and more commonplace in home computers. Most manufacturers build computers with 64-bit processors due to cheaper prices and because more users are now using 64-bit operating systems and programs. Computer parts retailers are offering fewer and fewer 32-bit processors and soon may not offer any at all.

Resources