Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
Well, I am trying to find an assembly instruction that moves whole instructions to a specific address(which is independent of the size of the instruction). If there is no such instruction, could anybody give me some ideas on how multithreading could be achieved without system calls or other software? In other words, let's suppose that I am making my own operating system, how can I enable multithreading with efficient code in assembly?
Re the question of how to achieve multithreading in raw assembly: Modern 32-bit x86 processors have hardware support for task switching. You can use this to implement multitasking (multiple processes with distinct address spaces) and/or multithreading (multiple execution threads in a single process).
This functionality is documented in the Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3, Chapter 7, Task Management. NOTE: Reading & understanding this ridiculously dense 4700 page document is likely to require 12 lifetimes.
Using this hardware support is the most straightforward way to get something running but probably not the most efficient. x86-based operating systems early on moved away from this to manual task switching. This allows the system to implement switching features not provided by the CPU and also to perform optimized switches when a full switch would be overkill. The approach became so universal that the 64-bit long mode of x86 processors dropped hardware task management support. Modern operating systems use manual task switching.
Re the idea of moving instructions as a way to achieve multithreading, I think your concept here is to move code around the address space when switching to new code is necessary. There are 2 alternative techniques that eliminate the need to do this.
First, instead of moving code you can jump to code at a different location. Early multitasking operating systems such as Unix on the PDP-11 used this technique. You can load all active programs into memory at different locations, set up a periodic interrupt to drop control to your system software every so often, and on each interrupt choose the next program to jump to. The system should keep track of where each program stops so it can resume at the same place.
The second approach depends on virtual memory. There is still a single physical memory space. On top of that, multiple "virtual" memory spaces are defined. At any given time only 1 virtual memory space is active. Whenever memory is accessed the system translates the virtual address into some physical address. The program accessing memory only sees its own virtual address space.
Each task is given its own address space. When switching to a new task the system activates its private address space then resumes execution wherever it was last paused in the private address space. Instead of moving things around the single shared address space, or even jumping from one place to another in the shared address space, you actually define a whole set of separate address spaces. You shift into new spaces as necessary.
Modern multithreading depends heavily on this approach. Each process will have a private address space; the system shifts between them as necessary. A process will have 1+ threads. Every thread in a process shares the same address space. Then all you store per thread is where execution was last paused. Switching to a new thread in the same process keeps the current address space and just jumps within it to wherever the new thread was last paused.
Re the question of how to move instructions, assembly languages don't provide such an ability. It's not a common thing to do. If you wanted to operate on machine code in memory you would likely want to use a library like the Intel X86 Encoder Decoder library. A library like this has knowledge of the instructions that exist and the byte format of machine code. This enables it to interpret a sequence of bytes as an instruction.
Once you're aware how large the following instruction is, moving that number of bytes is an easy task in assembly. In x86 you would use the mov instruction.
Related
I'm studying operating system theory, and I know that heap allocation involves a specific syscall and I know that compilers usually optimize for this requesting more than needed beforehand.
But I don't find information about stack allocation. What about it? It involves a specific syscall every time you read from it or write to it (for example when you call a function with some parameters)? Or there is some other mechanism that don't involve syscall perhaps?
Typically when the OS starts your program it examines the executable file's headers and arranges various areas for various things (an area for your executable's code, and area for your executable's data, etc). This includes setting up an initial stack (and a lot more - e.g. finding shared libraries and doing dynamic linking).
After the OS has done all this, your executable starts executing. At this point you already have memory for a stack and can just use it without any system calls.
Note 1: If you create threads, then there will probably be a system call involved to create the thread and that system call will probably allocate memory for the new thread's stack.
Note 2: Typically there's "virtual memory" (what your program sees) and "physical memory" (what the hardware sees); and in between typically the OS does lots of tricks to improve performance and avoid wasting physical memory, and to hide resource limits (so you don't have to worry so much about running out of physical memory). One of these tricks is to allocate virtual memory (e.g. for a large stack) without allocating any actual physical memory, and then allocate the physical memory if/when the virtual memory is first modified. Other tricks include various "swap space" schemes, and memory mapped files. These tricks rely on requests generated by the CPU on your program's behalf (e.g. page fault exceptions) which aren't system calls, but have similar ("ask kernel to do something") characteristics.
Note 3: All of the above depends on which OS. Different operating systems do things differently. I've chosen words carefully - e.g. "Typically" means that most modern operating systems work like I've described (but "typically" does not imply that all possible operating systems work like that; and some operating systems do not work like I've described).
No, stack is normal memory. For process point of view, there is no difference (and so the nasty security bug, where you return a pointer to a data in stack, but stack now is changed.
As Brendan wrote, OS will setup stack for the process at program loading. But if you access a non-allocated page of stack (e.g. if your stack if growing), kernel may allocate automatically for you a new stack page. (not much different as when you try to allocate new memory in heap, and there is no more memory available on program space: but in this case you explicitly do a syscall to tell kernel you want more heap memory).
You will notice that usually stack go in one direction and heap (allocated memory) in the other direction (usually toward each others). So if you program need more stack you have space, but if you program do not need much stack, you can use memory for e.g. huge array. Or the contrary: if you do a lot of recursion, you allocate much stack (but you probably need less heap memory).
Two additional consideration: CPU may have special stack instruction. But you can see them as syntactic sugar (you can simulate PUSH and POP with MOV. CALL and RET with JMP (and simulated PUSH and POP).
And kernel may use a special stack for his own purposes (especially important for interrupts).
Based on a standard Linux system, where there is a userland application and the kernel network stack. Ive read that moving frames from user space to kernel space (and vica-versa) can be expensive in terms of CPU cycles.
My questions are,
Why? and is moving the frame in one direction (i.e from user to
kernel) have a higher impact.
Also, how do things differ when you
move into TAP based interfaces. As the frame will still be going
between user/kernel space. Do the space concerns apply, or is there some form of zero-copy in play?
Addressing questions in-line:
Why? and is moving the frame in one direction (i.e from user to
kernel) have a higher impact.
Moving to/from user/kernel spaces is expensive because the OS has to:
Validate the pointers for the copy operation.
Transfer the actual data.
Incur the usual costs involved in transitioning between user/kernel mode.
There are some exceptions to this, such as if your driver implements a strategy such as "page flipping", which effectively remaps a chunk/page of memory so that it is accessible to a userspace application. This is "close enough" to a zero copy operation.
With respect to copy_to_user/copy_from_user performance, the performance of the two functions is apparently comparable.
Also, how do things differ when you move into TAP based interfaces. As
the frame will still be going between user/kernel space. Do the space
concerns apply, or is there some form of zero-copy in play?
With TUN/TAP based interfaces, the same considerations apply, unless you're utilizing some sort of DMA, page flipping, etc; logic.
Context Switch
Moving frames from user space to kernel space is called context switch, which is usually caused by system call (which invoke the int 0x80 interrupt).
Interrupt happens, entering kernel space;
When interrupt happens, os will store all of the registers' value into the kernel stack of a thread: ds, es, fs, eax, cr3 etc
Then it jumps to IRQ handler like a function call;
Through some common IRQ execution path, it will choose next thread to run by some algorithm;
The runtime info (all the registers) is loaded from next thread;
Back to user space;
As we can see, we will do a lot of works when moving frame into/out kernel, which is much more work than a simple function call (just setting ebp, esp, eip). That is why this behavior is relatively time-consuming.
Virtual Devices
As a virtual network devices, writing to TAP has no differences compared with writing to a /dev/xxx.
If you write to TAP, os will be interrupted like upper description, then it will copy your arguments into kernel and block your current thread (in blocking IO). Kernel driver thread will be notified in some ways (e.g. message queue) to receive the arguments and consume it.
In Andorid, there exists some zero-copy system call, and in my demo implementations, this can be done through the address translation between the user and kernel. Because kernel and user thread not share same address space and user thread's data may be changed, we usually copy data into kernel. So if we meet the condition, we can avoid copy:
this system call must be blocked, i.e. data won't change;
translate between addresses by page tables, i.e. kernel can refer to right data;
Code
The following are codes from my demo os, which is related to this question if you are interested in detail:
interrupt handle procedure: do_irq.S, irq_handle.c
system call: syscall.c, ide.c
address translation: MM_util.c
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
This is a noob question about computer science: How is ram allocated?
Fo example, I use Windows. Can I know which adresses are used by a program? How does Windows allocate memory? Contiguous or non contiguous?
Is it the same thing on Linux OSes ?
And, can I have access to the whole ram with a program? (I don't believe in that, but...)
Do you know any good lectures/documentation on this?
First, when you think you are allocating RAM, you really are not. This is confusing, I know, but it's really not complicated once you understand how it works. Keep reading.
RAM is allocated by the operating systems in units called "pages". Usually, this means contiguous regions of 4kiB, but other sizes are possible (to complicate things further, there exists support for "large pages" (usually on the order of 1-4MiB) on modern processors, and the operating system may have an allocation granularity different from the page size, for example Windows has a page size of 4kiB with a granularity of 64kiB).
Let's ignore those additional details and just think of "pages" that have one particular size (4KiB).
If you allocate and use areas that are greater than the system's page size, you will usually not have contiguous memory, but you will nevertheless see it as contiguous, since your program can only "think" in virtual addresses. In reality you may be using two (or more) pages that are not contiguous at all, but they appear to be. These virtual addresses are transparently translated to the actual addresses by the MMU.
Further, not all memory that you believe to have allocated necessarily exists in RAM at all times, and the same virtual address may correspond to entirely different pieces of RAM at different times (for example when a page is swapped out and is later swapped in again -- your program will see it at the same address, but in reality it is most likely in a different piece of RAM).
Virtual memory is a very powerful instrument. While one address in your program can only refer to [at most] one physical address (in a particular page) in RAM, one physical page of RAM can be mapped to several different addresses in your program, and even in several independent programs.
It is for example possible to create "circular" memory regions, and code from shared libraries is often loaded into one memory location, but used by many programs at the same time (and it will have different addresses in those different programs). Or, you can share memory between programs with that technique so when one program writes to some address, the value in the other program's memory location changes (because it is the exact same memory!).
On a high level, you ask your standard library for memory (e.g. malloc), and the standard library manages a pool of regions that it has reserved in a more or less unspecified way (there are many different allocator implementations, they all have in common that you can ask them for memory, and they give back an address -- this is where you think that you are allocating RAM when you are not).
When the allocator needs more memory, it asks the operating system to reserve another block. Under Linux, this might be sbrk and mmap, under Windows, this would for example be VirtualAlloc.
Generally, there are 3 things you can do with memory, and it generally works the same under Linux and Windows (and every other modern OS), although the API functions used are different, and there are a few more minor differences.
You can reserve it, this does more or less nothing, apart from logically dividing up your address space (only your process cares about that).
Next, you can commit it, this again doesn't do much, but it somewhat influences other processes. The system has a total limit of how much memory it can commit for all processes (physical RAM plus page file size), and it keeps track of that. Which means that memory you commit counts against the same limit that another process could commit. Otherwise, again, not much happens.
Last, you can access memory. This, finally, has a noticeable effect. Upon first accessing a page, a fault occurs (because the page does not exist at all!), and the operating system either fetches some data from a file (if the page belongs to a mapping) or it clears some page (possibly after first saving it to disk). The OS then adjusts the structures in the virtual memory system so you see this physical page of RAM at the address you accessed.
From your point of view, none of that is visible. It just works as if by magic.
It is possible to inspect processes for what areas in their address space are used, and it is possible (but kind of meaningless) to translate this to physical addresses. Note that the same program run at different times might store e.g. one particular variable at a different address. Under Windows, you can for example use the VMMap tool to inspect process memory allocation.
You can only use all RAM if you write your own operating system, since there is always a little memory which the OS reserves that user processes cannot use.
Otherwise you can in principle use [almost] all memory. However, whether or not you can directly use that much depends on whether your process is 32 or 64 bits. Computers nowadays typically have more RAM than you can address with 32 bits, so either you need to use address windowing extensions or your process must be 64 bits. Also, even given an amount of RAM that is in principle addressable using 32 bits, some address space factors (e.g. fragentation, kernel reserve) may prevent you from directly using all memory.
How to give single common address space for all tasks. IF its happening like this can we avoid virtual to physical memory mapping.
I f all task sharing common address space then how can we avoid virtual to physical memory mapping.
There are a few modern (research) OS's that do this, like Singularity and there are performance benefits, primarily because it no longer needs to do context changes and the file/symbol loader no longer needs to do address translation for global caches and kernel functions.
You do need to be a bit more specific about what you're looking for, tho'. You tagged your post as OSX and Linux, both of which require virtual memory. When running on systems without a MMU (and thus no virtual memory) it emulates it, which I'm fairly certain you can't circumvent. I'm not an expert by any means.
uClinux is an implementation of Linux that runs on processors that lack an MMU (such as ARM7), so by definition must have a single address space for all tasks.
So one answer to "how" is "use uClinux".
You tagged this VxWorks, and there is another answer; VxWorks supports a flat memory. In fact when I last used it the MMU protection was an (expensive) add on. Many other RTOS designed for micro controllers similarly do not support an MMU, such as eCOS, and FreeRTOS.
Of RTOS's that do support an MMU, QNX is probably amongst the most robust and mature, while still maintaining high performance.
I'm not sure why you would want to disable virtual memory mapping - it's a built in function of the cpu, and pretty much essential when running an OS to properly isolate processes from each other.
Most operating systems allow you to disable virtual memory, so that your memory capacity is limited by physical memory. However, A processes address space is still virtual, and virtual to physical mapping is still happening.
A way to get what you want is to run an operating system that executes in Real Mode, such as DOS or Windows 3.0, or write your own.
The advantages of virtual memory far outweigh the disadvantages. Why do you want to avoid virtual memory.
This is how some older operating systems and even how some modern operating systems that lack VM still work. It has many disadvantages for things like desktop and server applications but it can be useful in an embedded and/or real-time context, or where you have minimal hardware.
The VxWorks AE(Advanced Edition supports) deviates from the concept of Common address space for all tasks.So it can effectively be used in both systems with MMU and without MMU .The common address space for all tasks is called flat memory model and the separate address space for different tasks is called over lapped memory model or segmented memory model.You should not confuse the memory model with the memory lay out as seen in object files which divides data in to Code Segment ,Data Segment ,BSS etc .Both are entirely different things :).
This link in stack overflow will help better
Difference between flat memory model and protected memory model?
I know that under Windows, there are API functions like global_alloc() and such, which allocate memory, and return a handle, then this handle can be locked and a pointer returned, then unlocked again. When unlocked, the system can move this piece of memory around when it runs low on space, optimising memory usage.
My question is that is there something similar under Linux, and if not, how does Linux optimize its memory usage?
Those Windows functions come from a time when all programs were running in the same address space in real mode. Linux, and modern versions of Windows, run programs in separate address spaces, so they can move them about in RAM by remapping what physical address a particular virtual address resolves to in the page tables. No need to burden the programmer with such low level details.
Even on Windows, it's no longer necessary to use such functions except when interacting with a small number of old APIs. I believe Raymond Chen's blog and book have some discussions of the topic if you are interested in more detail. Eg here's part 4 of a series on the history of GlobalLock.
Not sure what Linux equivalent is but in ATT UNIX there are "scatter gather" memory management functions in the memory manager of the core OS. In a virtual memory operating environment there are no absolute addresses so applications don't have an equivalent function. The executable object loader (loads executable file into memory where it becomes a process) uses memory addressing from the memory manager that is all kept track of in virtual memory blocks maintained in its page table (which contains the physical memory addresses). Bottom line is your applications physical memory layout is likely in no way ever linear or accessible directly.