jemalloc and concurrent hash map - malloc

I have huge concurrent hash map ( ACE ) , when I am saying huge , I am doing lots and lots of put and get operation. This is on windows platform. Recently I was reading about jemalloc ,
For the above scenario If I use jemalloc, will it help me in performance and avoid memory fragmentation.

Related

Data security during dynamic memory allocation

Several minutes ago, I and my friends solved some algorithmic problems on the leetcode.com and share our solutions. We used high level languages and when new memory allocated by Array.new(128) in Ruby or int[] map = new int[128]; in Java it already filled by zero-like values nil or 0 respectively.
So it's guarantied that high level program have cleared place.
And here I have a question: In C or Assembler program could it happens that new chunk of memory stores data from other process unchanged?
And thus one process get data of another process. And even may be data from another user that worked in system some time ago. Could it be a way information leaked?
Do OS clear a memory before sharing it among processes? and If so is it very expensive to run so many iterations?
Thank you.
UPD: http://www.cplusplus.com/articles/ETqpX9L8/ looks like it need to clear valuable data in "lower-level" languages manually to prevent data leaks to other processes.
Yes, in lower-level languages where memory is not initialized, it could contain valuable stuff from other processes. There have been encryption key leakage attacks done this way by continually allocating memory and scanning it for what looks like useful information.
Security sensitive programs that store passwords or crypto keys, etc should always clear the memory ASAP after use. It's not only to prevent leaks through re-allocated memory, but there are also other attack vectors like RAM dumps that could be used to extract secrets. Always zero or randomize your memory when you are done with it.

Difference between "instruction fetch" and "data read" ?

I have a question regarding a paper I am reading right now, which is a demonstration of an attack against some tampering resistant software, using self-hashing mechanism. This kind of self hashing is working because authors are making the assumption that the executed code is the same as the hashed code, which is true except against some manipulations against the way a processor is manipulating the memory.
In the paper, there is the following sentence which troubles me : "A critical (implicit) assumption of both the hashing in Aucsmith’s IVK and checksum systems employing networks is that processors operate such that D(x) = I(x), where D(x) is the bit-string result of a “data read”
from memory address x, and I(x) is the bit-string result of an “instruction fetch” of corresponding length from x."
How do you state the difference between D(x) and I(x) ? What is the difference between a data read and an instruction fetch ?
Thanks for your help
The difference in these operations is when they occur and where the data is stored before use. Most processors have dedicated caches for instructions. This may mean that the data is fetched from main memory twice: once into a data cache for calculating the hash and again into the instruction cache.
I cannot find it now, but a year ago I read of a means of hiding malicious code on an Intel processor by causing a cache incoherency between these two caches. The processor would execute the malicious code, but any other tool reading the same memory as mere data would see good code. Here is a means of accomplishing this on an ARM chip.

How to optimize an algorithm for a given multi-core architecture

I would like to know what techniques I should look up-to for optimizing a given algorithm for a given architecture. How do I improve performance using better caching. How do I reduce cache coherency or what access patterns should I avoid in my algorithm/program so that cache coherency doesn't impact my performance?
I understand a few standard techniques for using the recently cached data in L1 but how would I use data in a shared cache(say L2) on a multi-core effectively thereby I avoid a main-memory access which is even more costlier?
Basically, I am interested in what data access patterns I should try to exploit or avoid for a better mapping to my given architecture. What data structure I could use, in what scenarios for what architectures(with different levels of private cache and shared cache) to improve performance. Thanks.
What techniques I should look up-to for optimizing a given algorithm for a given architecture?
Micro-architectures vary, so learn the details of your specific processor. Intel provides good documentation in their optimization guide. If you are using an Intel processor you'll want to read sections 8.3 and 8.6:
8.3 OPTIMIZATION GUIDELINES
This section summarizes optimization guidelines for tuning multithreaded applications. Five areas are listed (in order of importance):
Thread synchronization
Bus utilization
Memory optimization
Front end optimization
Execution resource optimization
Practices associated with each area are listed in this section. Guidelines for each area are discussed in greater depth in sections that follow. Most of the coding recommendations improve performance scaling with processor cores; and scaling due to HT Technology. Techniques that apply to only one environment are noted.
8.6 MEMORY OPTIMIZATION
Efficient operation of caches is a critical aspect of memory optimization. Efficient operation of caches needs to address the following:
Cache blocking
Shared memory optimization
Eliminating 64-KByte aliased data accesses
Preventing excessive evictions in first-level cache
What data access patterns I should try to exploit or avoid for a better mapping to my given architecture?
Exploit
When caches are full and an access misses in the cache the cache must evict something to make room for the new data/code, what is evicted is usually based on an approximation of least-recently used (LRU). If possible then your code should have strong locality of reference:
Try to pack data that is used close in time in the algorithm such that it is close in space (address)
Pack data tightly, don't use a 64-bit integer when a 32-bit integer will do, for example
Sometimes the alignment of an "object" (related data) relative to a cache line matters. For example, if there is an array of objects each of 64-bytes and they are accessed randomly then aligning at a 64-byte boundary will improve cache efficiency by not bringing in data that is not used. If the object isn't aligned then every object touched brings in two cache lines, but only 64-bytes are needed, so 50% of data transferred isn't used (assumes cache lines are 64-bytes).
As #PaulA.Clayton pointed out in the comments, pre-fetching data is very important, as it hides part or all of the memory latency. "Also, exploiting stride-based hardware prefetching can be quite beneficial. (Software prefetching can also be useful in some cases.) Getting pointers early helps increase memory-level parallelism."
In order to facilitate the hardware pre-fetcher and to increase the utilization of the data that is brought into the cache pay careful attention to how matrices and other large structures are stored and accessed... see Wikipedia article on row-major order.
Avoid
Data that you don't use often shouldn't be close to data that you use frequently
Avoid false sharing. If two or more threads access the same cache line but are not sharing the same data within the cache line and at least one of them is a writer you have false sharing... there will be unnecessary burden and latency hit associated with cache coherency protocol.
Try not to use new data until you are done with the older data
Measure
As Andrei Alexandrescu said in this talk - when it comes to performance tuning the only intuition that is right is "I should measure this." Familiarize yourself with cache performance monitoring tools, for example:
perf
Cachegrind
The key principle is locality: when you have the choice, process nearby data first (avoid sparse accesses), and perform data reuse as soon as possible (regroup successive passes over the same data).
For multithreaded programs, the principle is separate locality: ensure that the threads work on disjoint data sets (use distinct copies is necessary/possible).
Unless you have very good reasons to do so, stay away from the peculiarities of the hardware.
It should be mentioned that code is also cached in the same way as data. Small, dense code with a lot of inlining and few jumps/calls will put less strain on the L1C cache and, ultimately, L2, L3 and RAM where collisions with data fetches will occur.
If you are using hyperthreading there appears to be evidence to indicate that a lower optimization level (O1) on two hyperthreads in a core will overall get more work done than a single, highly optimized (O2 and higher) thread.

How to choose shared memory interfaces in Linux?

Linux has two different ways to manage shared memory: shm_open()/mmap() and shmget()/shmat(). What are the pros and cons of each? How do I decide which one to choose for my application?
I am asking myself the same question, and I don't know the final answer, but I thought I would share my benchmarks.
I have found that out-of-the box, the POSIX shm_open framework is faster than the System V shmget.
In my benchmark, I write 32 GB of memory from one process and read the same 32 GB of memory and verify it. I use ZeroMQ to pass ownership tokens from the writer to the reader to keep things in sync. The memory block size is actually pretty small, 32KB, but I've found this doesn't seem to be a rate limiting factor, neither does the presence of ZeroMQ make a big difference.
I've found that the net throughput using POSIX shared memory is about 40% faster than System V shared memory. Specifically, the net operation (write + read) operates at a sustained rate of 3.5 GB/s for the POSIX shared memory and 2.5 GB/s for the System V shared memory.
As why this is true, I don't know. I've also found these benchmarks a bit slippery if you don't pin down process and memory affinities, although this speed difference appears to be present regardless of any combination of CPU binding and memory binding (using numactl).

Mongo suffering from a huge number of faults

I'm seeing a huge (~200++) faults/sec number in my mongostat output, though very low lock %:
My Mongo servers are running on m1.large instances on the amazon cloud, so they each have 7.5GB of RAM ::
root:~# free -tm
total used free shared buffers cached
Mem: 7700 7654 45 0 0 6848
Clearly, I do not have enough memory for all the cahing mongo wants to do (which, btw, results in huge CPU usage %, due to disk IO).
I found this document that suggests that in my scenario (high fault, low lock %), I need to "scale out reads" and "more disk IOPS."
I'm looking for advice on how to best achieve this. Namely, there are LOTS of different potential queries executed by my node.js application, and I'm not sure where the bottleneck is happening. Of course, I've tried
db.setProfilingLevel(1);
However, this doesn't help me that much, because the outputted stats just show me slow queries, but I'm having a hard time translating that information into which queries are causing the page faults...
As you can see, this is resulting in a HUGE (nearly 100%) CPU wait time on my PRIMARY mongo server, though the 2x SECONDARY servers are unaffected...
Here's what the Mongo docs have to say about page faults:
Page faults represent the number of times that MongoDB requires data not located in physical memory, and must read from virtual memory. To check for page faults, see the extra_info.page_faults value in the serverStatus command. This data is only available on Linux systems.
Alone, page faults are minor and complete quickly; however, in aggregate, large numbers of page fault typically indicate that MongoDB is reading too much data from disk and can indicate a number of underlying causes and recommendations. In many situations, MongoDB’s read locks will “yield” after a page fault to allow other processes to read and avoid blocking while waiting for the next page to read into memory. This approach improves concurrency, and in high volume systems this also improves overall throughput.
If possible, increasing the amount of RAM accessible to MongoDB may help reduce the number of page faults. If this is not possible, you may want to consider deploying a shard cluster and/or adding one or more shards to your deployment to distribute load among mongod instances.
So, I tried the recommended command, which is terribly unhelpful:
PRIMARY> db.serverStatus().extra_info
{
"note" : "fields vary by platform",
"heap_usage_bytes" : 36265008,
"page_faults" : 4536924
}
Of course, I could increase the server size (more RAM), but that is expensive and seems to be overkill. I should implement sharding, but I'm actually unsure what collections need sharding! Thus, I need a way to isolate where the faults are happening (what specific commands are causing faults).
Thanks for the help.
We don't really know what your data/indexes look like.
Still, an important rule of MongoDB optimization:
Make sure your indexes fit in RAM. http://www.mongodb.org/display/DOCS/Indexing+Advice+and+FAQ#IndexingAdviceandFAQ-MakesureyourindexescanfitinRAM.
Consider that the smaller your documents are, the higher your key/document ratio will be, and the higher your RAM/Disksize ratio will need to be.
If you can adjust your schema a bit to lump some data together, and reduce the number of keys you need, that might help.

Resources