One application, multiple instances, different memory usage

One application, multiple instances, different memory usage - node.js

I have node.js server running two instances in cluster mode (via pm2).
The two instances are obviously identical, they execute the same code, load the same data.
Yet memory usage differs by over 100%:
Instance 1: 303,592kB
Instance 2: 614,404kB
Is there any reason the OS (Linux) can cause this behavior? The machine has plenty or RAM, so I would exclude memory shortage.

Have the two servers been running for the same amount of time? Did they answer the same requests?
Node.js is a garbage-collected runtime. Memory use over time is not constant. The garbage collector kicks in depending on allocation behavior, heap size and limit, idleness, and possibly other factors. Maybe your instance 1 has just done a major round of garbage collection, and instance 2 is about to do one? Have you watched their memory usage over time?

Related

Spark tuning issues

strong text
Why this stage has been running with 1 thread at end ? Due to this it is taking much time to finish, I guess here it is not achieving parallel process.
So can any one explain it ?

As you haven't put any more specific information about what exactly are you trying to do there can be only broad answer.
Most common cause if you have one (or just a few) tasks hanging in larger pool of tasks is skewed data.
Another option is that the data triggered task that might be taking longer to compute the data (CPU heavy)
Or your task is hanging on IO which might indicate network/IO channel saturation.

The question is pretty generic. Spark documentation says that it is not really easy to find bottlenecks directly or indirectly even for smallest of the programs (such as WordCount). The bottleneck can be in IO, memory to CPU, CPU where Garbage collection is going on, network and other factors internal to spark (such as scheduler delays, buffer memory overflows etc).
So, you might need to dig deeper keeping the below in mind:
a. Do you have many cores freely available to share the load of the stage.
b. How many executors are configured for this job to finish
c. is the 200GB data read/write justified for the job that you are doing.
d. free RAM on server before job trigger.
e. Go to YARN resource manager to see the resources around memory and CPU cores (in case you are using YARN).

Does threading a lot leads to thrashing?

Does threading a lot leads to thrashing if each new thread wants to access the memory (specifically the same database in my case) and perform read/write operations through out its lifetime?
I assume that this is true. If my assumption is true, then what is the best way to maximize the CPU utilization? And how can i determine that some specific number of threads will give good CPU utilization?
If my assumption is wrong, please do give proper illustrations to let me understand the scenario clearly.

Trashy code causes trashing. Not thread. All code is ran by some threads, even the main(). Temp objects are garbage collected the same way on any thread.
The subtle part is when each thread preloads its own objects to perform the work, which can duplicate a lot of same classes. It's usually a small sacrifice to make to get the power of concurrency. But it's not trash (no leak, no deterioration).
There is one exception: when some 3rd party code caches material in thread locals... You could end up caching the same stuff on each thread. Not really a leak, but not efficient.
Rule of thumb for number of threads? Depends on the task.
If the tasks are pure computation like math, then you should not exceed the number of non-hyperthreaded cores.
If the job is memory intensive along with pure computation work (most cases), then the number of hyperthreaded cores is your target (because the CPU will use the idle time of memory access for another core computations).
If the job is mostly large sequential disk i/o, then you number of threads should be not to much above the number of disk spindle available to read. This is VERY approximative since the disk caches, DMA, SSD, raids and such are completely affecting how the disk layer can service your thread without idling. When using random access, this is also valid. However, the virtualization these days will throw all your estimates out the window. Disk i/o could be much more available than you think, but also much worse.
If the jobs are mostly network i/o waits, then it is not really limited from your side; I would go with about 3x the number of cores to start. This multiplier is simply presuming that such thread wait on network for 2/3 of its time. Which is very low in practice. Could be 99% of its time waiting for nw i/o (100x). Which is why you see NIO sockets everywhere, to deal with many connections with fewer busier threads.

No, you could have 100's of idle threads waiting for work and not see any thrashing, which is caused by application working set size exceeding available memory size, so active pages need to be reloaded from disk (even written out to disk to when temporary variable storage needs saving to be relaoded later).
Threads share an address space, having many active leads to diminishing returns due to lock contention. So in the DB case, many processes reading tables can proceed simultaneously, yet updates of dependant data need to be serialised to keep data consistent which may cause lock contention and limit parallel processing.
Poorly written queries which need to load & sort large tables into memory, may cause thrashing when they exceed free RAM (perhaps poor choice of indexs). You can increase the query throughput, to utilise CPUs more, by having large RAM disk caches and using SSDs to reduce random data access times.
On memory intensive computations, cache sizes may become important, fewer threads whose data stays in cache and CPU pre-fetches minimise stalls, work better than threads competing to load their data from main memory.

Nodejs (socket.io) memory usage compared to JavaScript's primitive data type memory usage

First off, my title may be poorly defined or misleading, but it pretty much tries to summarize my question as a whole.
I've searched a lot without getting my questions fully answered.
How does socket.io memory allocation through MemoryStorage (socket.get/socket.set) work and how much memory usage does approximately one socket.set use? Are memory freed correctly when a socket disconnects? Optional: Is there any known memory leaks I should be aware of in v0.11.0-pre?
How does JavaScript's GarbageCollector work with objects and associative arrays declared in global scope? Are memory ultimately freed When I "delete" a key-value pair like this: "delete object[key];"? Or will RAM continuely increase as client requests increase?
How are option 1 and 2 compared to each other? Should I use socket.set over globally declared "maps" when it comes to increasing and ultimately freeing memory? Optional: How are they compared to each other when it comes to performance ("executing" socket.get/object[key])?
As for basic information on my project, I'm developing a game server in Node.js (single process) which are expected to accept as many clients as a single server can handle (before extending to a cluster of servers). Optional: Is there something else regarding my project I should be aware of when it comes to load and memory??
Thank you for your time, it is much appreciated!

Pros and Cons of CPU affinity

Suppose I have a multi-threaded application (say ~40 threads) running on a multiprocessor system (say 8 cores) with Linux as the operating system where different threads are more essentially LWP (Light Weight Processes) being scheduled by the kernel.
What would be benefits/drawbacks of using the CPU affinity? Whether CPU affinity is going to help by localizing the threads to a subset of cores thus minimizing cache sharing/misses?

If you use strict affinity, then a particular thread MUST run on that processor (or set of processors). If you have many threads that work completely independently, and they work on larger chunks of memory than a few kilobytes, then it's unlikely you'll benefit much from running on one particular core - since it's quite possible the other threads running on this particular CPU would have thrown out any L1 cache, and quite possibly L2 caches too. Which is more important for performance - cahce content or "getting to run sooner"? Are some CPU's always idle, or is the CPU load 100% on every core?
However, only you know (until you tell us) what your threads are doing. How big is the "working set" (how much memory - code and data) are they touching each time they get to run? How long does each thread run when they are running? What is the interaction with other threads? Are other threads using shared data with "this" thread? How much and what is the pattern of sharing?
Finally, the ultimate answer is "What makes it run faster?" - an answer you can only find by having good (realistic) benchmarks and trying the different possible options. Even if you give us every single line of code, running time measurements for each thread, etc, etc, we could only make more or less sophisticated guesses - until these have been tried and tested (with VARYING usage patterns), it's almost impossible to know.
In general, I'd suggest that having many threads either suggest that each thread isn't very busy (CPU-wise), or you are "doing it wrong"... More threads aren't better if they are all running flat out - better to have fewer threads in that case, because they are just going to fight each other.

The scheduler already tries to keep threads on the same cores, and to avoid migrations. This suggests that there's probably not a lot of mileage in managing thread affinity manually, unless:
you can demonstrate that for some reason the kernel is doing a bad a job for your particular application; or
there's some specific knowledge about your application that you can exploit to good effect.

localizing the threads to a subset of cores thus minimizing cache
sharing/misses
Not necessarily, you have to consider cache coherence too, if two or more threads access a shared memory buffer and each one is bound to a different CPU core their caches have to be synchronized if one thread writes to a shared cache line there will be a significant overhead to invalidate other caches.

Dual-Core Hyperthreading: Should I use 4 threads or 3 or 2?

If you're spawning multiple threads (or processes) concurrently, is it better to spawn as many as the number of physical processors or the number of logical processors, assuming the task is CPU-bound? Or is it better to do something in between (say, 3 threads)?
Does the performance depend on the kind of instructions that are getting executed (say, would non-local memory access be much different from cache hits)? If so, in which cases is it better to take advantage of hyperthreading?
Update:
The reason I'm asking is, I remember reading somewhere that if you have as many tasks as the number of virtual processors, tasks on the same physical core can sometimes starve some CPU resources and prevent each other from getting as many resources as needed, possibly decreasing performance. That's why I'm wondering if having as many threads as virtual cores is a good idea.

The performance depends on a huge variety of factors. Most tasks are not strictly CPU bound, since even if all of the data is in memory it is usually not on-board in the processor cache. I have seen examples (like this one) where memory access patterns can dramatically change the performance profile of a given 'parallel' process.
In short, there is no perfect number for all situations.

Chances are pretty good that you will see a performance improvement running 2 threads per core with HyperThreading enabled. Jobs that appear to be entirely CPU bound usually aren't, and HyperThreading can extract a few "extra" cycles out of the occasional interrupt or context switch.
On the other hand, with a core iX processor that has Turbo Boost, you might actually do better running 1 thread per core to encourage the CPU to overclock itself.
At work, we routinely run many-core servers at full CPU doing various kinds of calculation for days at a time. A while back we measured the performance difference with and without HT. We found that on average, with HyperThreading, and running twice as many jobs at once, we could complete the same amount of jobs about 10% faster than than without HyperThreading.
Assume that 2 × cores is a good place to start, but the bottom line is: measure!

I remember info that hyperthreading can give you up to 30% of performance boost. in general you'd better to treat them as 4 different cores. of course in some specific circumstances (e.g. having the same long running task bound to each core) you can divide your processing better taking into account that some cores are just logical ones
more info about hyperthreading itself here

Using Hyperthreading to run two threads on the same core, when both threads have similar memory access patterns but access disjoint data structures, would be very roughly equivalent to running them on two separate cores each with half the cache. If the memory-access patterns are such that half the cache would be sufficient to prevent thrashing, performance may be good. If the memory-access patterns are such that halving the cache induces thrashing, there may be a ten-fold performance hit (implying one would have been much better off without hyperthreading).
On the other hand, there are some situations where hyperthreading may be a huge win. If many threads will all be reading and writing the same shared data using lock-free data structures, and all threads must see a consistent view of the data, trying to run threads on disjoint processor may cause thrashing since only one processor at a time may have read-write access to any given cache line; running such a threads on two cores may take longer than running only one at a time. Such cache arbitration is not required, however, when a piece of data is accessed by multiple threads on a single core. In those cases, hyperthreading can be a huge win.
Unfortunately, I don't know any way to give the scheduler any "hints" to suggest that some threads should share a core when possible, while others should run separately when possible.

HT allows a boost of approximately 10-30% for mostly cpu-bound tasks that use the extra virtual cores. Although these tasks may seem CPU-bound, unless they are custom made assembly, they will usually suffer from IO waits between RAM and local cache. This allows one thread running on a physical HT-enabled core to work while the other thread is waiting for IO. This does come with a disadvantage though, as two threads share the same cache/bus, which will result in less resources each which may cause both threads to pause while waiting for IO.
In the last case, running a single thread will decrease the maximum simultaneous theoretical processing power(by 10-30%) in favor of running a single thread without the slowdown of cache thrashing which may be very significant in some applications.
Choosing which cores to use is just as important as choosing how many threads to run. If each thread is CPU-bound for roughly the same duration it is best to set the affinity such that threads using mostly different resources find themselves on different physical cores and threads using common resources be grouped to the same physical cores(different virtual core) so that common resources can be used from the same cache without extra IO wait.
Since each program has different CPU-usage characteristics and cache thrashing may or may not be a major slowdown(it usually is) it is impossible to determine what the ideal number of threads should be without profiling first. One last thing to note is that the OS/Kernel will also require some CPU and cache space. It is usually ideal to keep a single (physical)core set aside for the OS if real-time latency is required on CPU-bound threads so as to avoid sharing cache/cpu resources. If threads are often waiting for IO and cache thrashing is not an issue, or if running a real-time OS specifically designed for the application, you can skip this last step.
http://en.wikipedia.org/wiki/Thrashing_(computer_science)
http://en.wikipedia.org/wiki/Processor_affinity

All of the other answers already give lots of excellent info. But, one more point to consider is that the SIMD unit is shared between logical cores on the same die. So, if you are running threads with SSE code, do you run them on all 4 logical cores, or just spawn 2 threads (assuming you have two chips)? For this odd case, best to profile with your app.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string