Weird behaviour in Linux for a simple program - linux

We wrote a very simple C++ program to isolate a bug. The app takes a number as an argument and creates that number of threads and sends all those threads into an event loop. If we run app with >3 threads (including main thread) top shows it taking 100+MB in virtual memory. However, if we run it with <=3 threads, it runs with about 36MB virtual memory. We strace'd the app and found out that in the first scenario there is a mmap of about 65MB that is mapped anonymously that does not get unmapped. The problem is the memory usage goes up as the number of threads go up. And we have a large number of binaries which have large number of threads so there seems to be a lot of wasted space. Why does this happen? SLES11 64bit.

Each thread gets by default a stack of around 8Mb. You can set the default when you create a thread with pthread_attr_setstacksize. Make sure you are always either: pthread_join()'ing threads that have ended. Or; create them as detached threads, otherwise you'll leak memory when a thread ends.
Having a big virtual memorry usage is usually not a problem though, unless you really are using all that space, it's just virtual memory - and you'll hardly run out of that on a 64 bit machine.

Related

Can the two threads of a virtual core (hyperthreading) be running different OS processes?

I found the following quotation from another answer (Performance difference for multi-thread and multi-process):
Next, you can have CPUs with "hyperthreading", which can run (at
least) two threads on a core very rapidly -- but, not processes (since
the "hyperthreaded" threads cannot be using distinct address spaces)
-- yet another case in which threads can win performance-wise.
Is this accurate? The two threads of a virtual core (hyperthreading) cannot be running different OS processes?
On a hyperthreaded machine, if I have a program architecture that uses "worker" processes that a "supervisor" process communicates with using sockets, would I be likely to see a performance increase by moving those worker processes into the supervisor process as threads (leaving the sockets and everything else the same)?
First part of the question can be answered quickly, you can test it on your system if its the same.
On my old windows system with an intel CPU I can from the task manager set the affinity of different programs (or processes) to on each of the 2 HT of a core, seeing that they will actually both run there.
On the second question with Workers and Supervisors you could get a small performance gain (at least on Linux) by having them in the same process. This is because when you make a task switch you also make set the Page Table (intel CR3-register), and the code looks something like this during task switch:
if (newProcess != oldProcess)
CR3 = new page table
Setting CR3 effectively invalidates the TLB as the page tables differ. The consequence is that the CPU must perform a page table walk to find the correct translations. On a 64-bit cpu a page walk will typical take 5 memory walks costing anything from 5*TLB 2 access (each ~3 cycles) to 5 * memory access (each ~300 cycles).

What happens to allocated memory of other threads when forking

I have a huge application that needs to fork itself at some point. The application is multithreaded and has about 200MB of allocated memory. What I want to do now to ensure that the data allocated by the process wont get duplicated is to start a new thread and fork inside of this thread. From what I have read, only the thread that calls fork will be duplicated, but what will happen to the allocated memory? Will that still be there? The purpose of this is to restart the application with other startup parameters, when its forked, it will call main with my new parameters, thus getting hopefully a new process of the same program. Now before you ask: I cannot assure that the binary of that process will still be in the same place as when I started the process, otherwise I could just fork and exec whats in /proc/self/exe.
Threads are execution units inside the big bag of resources that a process is. A process is the whole thing that you can access from any thread in the process: all the threads, all the file descriptors, all the other resources. So memory is absolutely not tied to a thread, and forking from a thread has no useful effect. Everything still needs to be copied over since the point of forking is creating a new process.
That said, Linux has some tricks to make it faster. Copying 2 gigabytes worth of RAM is neither fast or efficient. So when you fork, Linux actually gives the new process the same memory (at first), but it uses the virtual memory system to mark it as copy-on-write: as soon as one process needs to write to that memory, the kernel intercepts it and allocates distinct memory so that the other process isn't affected.

What happens if a process keeps creating threads?

What happens if a process keeps creating threads especially when the number of threads exceeds the limit of the OS? What will Windows and Linux do?
If the threads aren't doing any work (i.e. you don't start them), then on Windows you're subject to resource limitations as pointed out in the blog post that Hans linked. A Linux system, too, will have some limit on the number of threads it can create; after all, your computer doesn't have infinite virtual memory, so at some point the call to create a thread is going to fail.
If the threads are actually doing work, what usually happens is that the system starts thrashing. Each thread (including the program's main thread) gets a small timeslice (typically measured in tens of milliseconds), and then it gets swapped out for the next available thread. With so many threads, their working sets are large enough to occupy all available RAM, so every thread context switch requires that the currently running thread is written to virtual memory (disk), and the next available thread is read from disk. So the system spends more time doing thread context switches than it does actually running the threads.
The threads will continue to execute, but very very slowly, and eventually you will run out of virtual memory. However, it's likely that it would take an exceedingly long time to create that many threads. You would probably give up and shut the machine off.
Most often, a machine that's suffering from this type of thrashing acts exactly like a machine that's stuck in an infinite loop on all cores. Even pressing Control+Break (or similar) won't take effect immediately because the thread that's handling that signal has to be in memory and running in order to process it. And after the thread does respond to such a signal, it takes an exceedingly long time for it to terminate all of the threads and clean up virtual memory.

Is spawning threads based on application memory usage an overkill?

I have a system that uses threads to do various jobs.
Each thread uses from enough to too much memory, so there are times that the PC gets out of memory.
Each thread works from 8sec to 40sec max. approximatelly.
Is using Process.WorkingSet64 before spawing a new thread (to check for memory usage) an overkill ?
Basically, I am trying to prevent out-of-memory situations.
Is using Process.WorkingSet64 too heavy for calling it that often (let's say once every 4 seconds)?

Memory addressing in assembly / multitasking

I understand how programs in machine code can load values from memory in to registers, perform jumps, or store values in registers to memory, but I don't understand how this works for multiple processes. A process is allocated memory on the fly, so must it use relative addressing? Is this done automatically (meaning there are assembly instructions that perform relative jumps, etc.), or does the program have to "manually" add the correct offset to every memory position it addresses.
I have another question regarding multitasking that is somewhat related. How does the OS, which isn't running, stop a thread and move on to the next. Is this done with timed interrupts? If so, then how can the values in registers be preserved for a thread. Are they saved to memory before control is given to a different thread? Or, rather than timed interrupts, does the thread simply choose a good time to give up control. In the case of timed interrupts, what happens if a thread is given processor time and it doesn't need it. Does it have to waste it, can it call the interrupt manually, or does it alert the OS that it doesn't need much time?
Edit: Or are executables edited before being run to compensate for the correct offsets?
That's not how it works. All modern operating systems virtualize the available memory. Giving every process the illusion that it has 2 gigabytes of memory (or more) and doesn't have to share it with anybody. The key component in a machine that does this is the MMU, nowadays built in the processor itself. Another core feature of this virtualization is that it isolates processes. One misbehaving one cannot bring another one down with it.
Yes, a clock tick interrupt is used to interrupt the currently running code. Processor state is simply saved on the stack. The operating system scheduler then checks if any other thread is ready to run and has a high enough priority to get first in line. Some extra code ensures that everybody gets a fair share. Then it just a matter of setting the MMU to resume execution on the other thread. If no thread is ready to run then the CPU gets physically turned off with the HALT instruction. To be woken again by the next clock interrupt.
This is ten-thousand foot view, it is well covered in any book about operating system design.
A process is allocated memory on the fly, so must it use relative addressing?
No, it can use relative or absolute addressing depending on what it is trying to address.
At least historically, the various different addressing modes were more about local versus remote memory. Relative addressing was for memory addresses close to the current address while absolute was more expensive but could address anything. With modern virtual memory systems, these distinctions may be no longer necessary.
A process is allocated memory on the fly, so must it use relative addressing? Is this done automatically (meaning there are assembly instructions that perform relative jumps, etc.), or does the program have to "manually" add the correct offset to every memory position it addresses.
I'm not sure about this one. This is taken care of by the compiler normally. Again, modern virtual memory systems make make this complexity unnecessary.
Are they saved to memory before control is given to a different thread?
Yes. Typically all of the state (registers, etc.) is stored in a process control block (PCB), a new context is loaded, the registers and other context is loaded from the new PCB, and execution begins in the new context. The PCB can be stored on the stack or in kernel memory or in can utilize processor specific operations to optimize this process.
Or, rather than timed interrupts, does the thread simply choose a good time to give up control.
The thread can yield control -- put itself back at the end of the run queue. It can also wait for some IO or sleep. Thread libraries then put the thread in wait queues and switch to another context. When the IO is ready or the sleep expires, the thread is put back into the run queue. The same happens with mutex locks. It waits for the lock in a wait queue. Once the lock is available, the thread is put back into the run queue.
In the case of timed interrupts, what happens if a thread is given processor time and it doesn't need it. Does it have to waste it, can it call the interrupt manually, or does it alert the OS that it doesn't need much time?
Either the thread can run (perform CPU instructions) or it is waiting -- either on IO or a sleep. It can ask to yield but typically it is doing so by [again] sleeping or waiting on IO.
I probably walked into this question quite late, but then, it may be of use to some other programmers. First - the theory.
The modern day operating system will virtualize the memory, and to do so, it maintains, within its system memory area, a series of page pointers. Each page is of a fixed size (usually 4K), and when any program seeks some memory, its allocated memory addresses that are virtualized using the memory page pointer. Its approximates the behaviour of "segment" registers in the prior generation of the processors.
Now when the scheduler decides to get another process running, it may or may not keep the previous process in memory. If it keeps it in memory, then all that the scheduler does is to save the entire register snapshot (now, including YMM registers - this bit was a complex issue earlier as there are no single instructions that saved the entire context : read up on XSAVE), and this has a fixed format (available in Intel SW manual). This is stored in the memory space of the scheduler itself, along with the information on the memory pages that were being used.
If however, the scheduler needs to "dump" the current process context that is about to go to sleep to the hard disk - this situation usually arises when the process that is waking up needs extraordinary amount of memory, then the scheduler writes the memory page files in the disk blocks (called pagefile - reserved area of memory - also the source of "old grandmother wisdom" that pagefile must be equal to size of real memory) and the scheduler preserves the memory page pointer addresses as offsets in the pagefile. When it wakes up, the scheduler reads from pagefile the offset address, allocates real memory and populates the memory page pointers, and then loads the contents from the disk blocks.
Now, to answer your specific questions :
1. Do u need to use only relative addressing, or you can use absolute?
And. You may use either - whatever u perceive to be as absolute is also relative as the memory page pointer relativizes that address in an invisible format. There is no really absolute memory address anywhere (including the io device memories) except the kernel of the operating system itself. To test this, u may unassemble any .EXE program, to see that the entry point is always CALL 0010 which clearly implies that each thread gets a different "0010" to start the execution.
How do threads get life and what if it surrenders the unused slice.
Ans. The threads usually get a slice - modern systems have 20ms as the usual standard - but this is sometimes changed in special purpose compilation for servers that do not have many hardware interrupts to deal with - in order of their position on the process queue. A thread usually surrenders its slice by calling function sleep(), which is a formal (and very nice way) to surrender your balance part of the time slice. Most libraries implementing asynchronous reads, or interrupt actions, call sleep() internally, but in many instances, top level programs also call sleep() - e.g. to create a time gap. An invocation to sleep will certainly change the process context - the CPU actually is not given the liberty to sleep using NOP.
The other method is to wait for an IO to complete, and this is handled differently. The program on asking for an IO process, will cede its time slice, and the process scheduler flags this thread to be in "WAITING FOR AN IO" state - and this thread will not be given a time slice by the processor till its intended IO is completed, or timed out. This feature helps programmers as they do not have to explicitly write a sleep_until_IO() kind of interface.
Trust this sets you going further in your explorations.

Resources