program life in terms of paged segmentation memory

program life in terms of paged segmentation memory - linux

I have a confusing notion about the process of segmentation & paging in x86 linux machines. Will be glad if some clarify all the steps involved from the start to the end.
x86 uses paged segmentation memory technique for memory management.
Can any one please explain what happens from the moment an executable .elf format file is loaded from hard disk in to main memory to the time it dies. when compiled the executable has different sections in it (text, data, stack, heap, bss). how will this be loaded ? how will they be set up under paged segmentation memory technique.
Wanted to know how the page tables get set up for the loaded program ? Wanted to know how GDT table gets set up. how the registers are loaded ? and why it is said that logical addresses (the ones that are processed by segmentation unit of MMU are 48 bits (16 bits of segment selector + 32 bit offset) when it is a bit 32 bit machine. how will other 16 bits be stored ? any thing accessed from ram must be 32 bits or 4 bytes how does the rest of 16 bits be accessed (to be loaded into segment registers) ?
Thanks in advance. the question can have a lot of things. but wanted to get clarification about the entire life cycle of an executable. Will be glad if some answers and pulls up a discussion on this.

Unix traditionally has implemented protection via paging. 286+ provides segmentation, and 386+ provides paging. Everyone uses paging, few make any real use of segmentation.
In x86, every memory operand has an implicit segment (so the address is really 16 bit selector + 32 bit offset), depending on the register used. So if you access [ESP + 8] the implied segment register is SS, if you access [ESI] the implied segment register is DS, if you access [EDI+4] the implied segment register is ES,... You can override this via segment prefix overrides.
Linux, and virtually every modern x86 OS, uses a flat memory model (or something similar). Under a flat memory model each segment provides access to the whole memory, with a base of 0 and a limit of 4Gb, so you don't have to worry about the complications segmentation brings about. Basically there are 4 segments: kernelspace code (RX), kernelspace data (RW), userspace code (RX), userspace data (RW).
An ELF file consists of some headers that pont to "program segments" and "sections". Section are used for linking. Program segments are used for loading. Program segments are mapped into memory via mmap(), this setups page-table entries with appropriate permissions.
Now, older x86 CPUs' paging mechanism only provided RW access control (read permission implies execute permission), while segmentation provided RWX access control. The end permission takes into account both segmentation and paging (e.g: RW (data segment) + R (read only page) = R (read only), while RX (code segment) + R (read only page) = RX (read and execute)).
So there are some patches that provide execution prevention via segmentation: e.g. OpenWall provided a non-executable stack by shrinking the code segment (the one with execute permission), and having special emulation in the page fault handler for anything that needed execution from a high memory address (e.g: GCC trampolines, self-modified code created on the stack to efficiently implement nested functions).

There's no such thing as paged segmentation, not in the official documentation at least. There are two different mechanisms working together and more or less independently of each other:
Translation of a logical address of the form 16-bit segment selector value:16/32/64-bit segment offset value, that is, a pair of 2 numbers into a 32/64-bit virtual address.
Translation of the virtual address into a 32/64-bit physical address.
Logical addresses is what your applications operate directly with. Then follows the above 2-step translation of them into what the RAM will understand, physical addresses.
In the first step the GDT (or it can be LDT, depends on the selector value) is indexed by the selector to find the relevant segment's base address and size. The virtual address will be the sum of the segment base address and the offset. The segment size and other things in segment descriptors are needed to provide protection.
In the second step the page tables are indexed by different parts of the virtual address and the last indexed table in the hierarchy gives the final, physical address that goes out on the address bus for the RAM to see. Just like with segment descriptors, page table entries contain not only addresses but also protection control bits.
That's about it on the mechanisms.
Now, in many x86 OSes the segment selectors that are used for applications are fixed, they are the same in all of them, they never change and they point to segment descriptors that have base addresses equal to 0 and sizes equal to the possible maximum (e.g. 4GB in non-64-bit modes). Such a GDT setup effectively means that the first step does no useful work and the offset part of the logical address translates into numerically equal virtual address.
This makes the segment selector values practically useless. They still have to be loaded into the CPU's segment registers (in non-64-bit modes into at least CS, SS, DS and ES), but beyond that point they can be forgotten about.
This all (except Linux-related details and the ELF format) is explained in or directly follows from Intel's and AMD's x86 CPU manuals. You'll find many more details there.

Perhaps read the Assembly HOWTO. When a Linux process starts to execute an ELF executable using the execve system call, it is essentially (sort of) mmap-ing some segments (and initializing registers, and a tiny part of the stack). Read also the SVR4 x86 ABI supplement and its x86-64 variant. Don't forget that a Linux process only see memory mapping for its address space and only cares about virtual memory
There are many good books on Operating Systems (=O.S.) kernels, notably by A.Tanenbaum & by M.Bach, and some on the linux kernel
NB: segment registers are nearly (almost) unused on Linux.

Related

Which direction does memory-mapped segment of a process's virtual address space grow by default?

I'm currently going through the code that loads an ELF from disk to memory, which corresponds to the function load_elf_binary() in Linux kernel.
Such function sets up the addresses of different segments (e.g. text, data, bss, heap, stack, mmap'ed area). By tracing the code, I noticed one function: setup_new_exec(), which is defined here in /fs/exec.c. Inside such function, it calls arch_pick_mmap_layout(), which is defined here. Note that I am not targeting a specific architecture like X86, so I am referring to the generic function definition.
Below is part of the code:
if (mmap_is_legacy(rlim_stack)) {
mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;
mm->get_unmapped_area = arch_get_unmapped_area;
} else {
mm->mmap_base = mmap_base(random_factor, rlim_stack);
mm->get_unmapped_area = arch_get_unmapped_area_topdown;
}
Based on the code, I know there are two ways of obtaining the unmapped areas - bottom-up(legacy) and top-down. Such two ways are discussed in this LWN article as well.
To distinguish, we need mmap_is_legacy(), which return sysctl_legacy_va_layout;. sysctl_legacy_va_layout is initialized to be 0 by default.
Does that mean by default, the memory mapped region of a process grows from top to bottom (from high address to low address; grows from the stack to the heap)?

Your general assumption that "by default, the memory mapped region of a process grows from top to bottom" is correct.
The default and legacy layouts nowadays should look like this:
DEFAULT LEGACY
0xffffffffffffffff 0xffffffffffffffff
stack stack
🡓 🡓
mmap ...
🡓 🡑
... heap
... ELF
🡑 ...
heap 🡑
ELF mmap
... ...
0x0000000000000000 0x0000000000000000
[...] the legacy layout nowadays has the mmap segment being the low address. Is there code that proves that? Besides, does the legacy memory layout nowadays start from virtual address 0?
Sure, you can see this exactly in the code you linked, in the generic arch_pick_mmap_layout() implementation, which chooses a low mmap_base for the legacy layout. The calculation is TASK_UNMAPPED_BASE + random_factor (the random_factor comes from ASLR, see /proc/sys/kernel/randomize_va_space). Note that some architectures (namely x86, PA-RISC, PowerPC, S390, Sparc) override that function and provide their own, but the calculations that are done are pretty much the same (you can check the source code).
That TASK_UNMAPPED_BASE represents the lower boundary for the "mmap" virtual memory area, and it varies per architecture. It should not be defined as low as 0 (zero) though. It is usually defined in terms of TASK_SIZE.
Some examples:
TASK_SIZE / 3 = 0x2aaaaaaab000 on x86-64
TASK_SIZE / 3 = 0x40000000 on x86 32bit with default VMSPLIT_3G config
TASK_SIZE / 4 = 0x400000000000 on ARM64
CONFIG_PAGE_OFFSET / 3 = 0x40000000 on ARM 32bit with default VMSPLIT_3G config
TASK_SIZE / 3 = 0x5555555000 on MIPS 64bit with 40 VA bits
TASK_SIZE / 8 * 3 = 0x30000000 on PowerPC 8xx (32bit)
The lowest possible address mappable by userspace is actually /proc/sys/vm/mmap_min_addr and default non-zero (for example 0x10000 on x86). Such low addresses must be explicitly requested through an hint to mmap, they are not mapped voluntarily by the kernel as a result of mmap(0, ...).
So we enstablished that for the legacy layout the "mmap" area starts at some low address, and we already know that the stack always starts at the highest address.
As per the ELF itself, the file is merely mapped in memory by the kernel according to its type and its program headers, usually contiguously with no holes, and the calculations are the same regardless of default/legacy mmap layout. You will see multiple segments mapped with different permissions as specified in the ELF program headers (see output of readelf -l), and those segments will contain different sections, such as .text, .rodata, .bss, and so on (see output of readelf -S).
For ELF Executables (i.e. e_type = ET_EXEC, see man 5 elf) the base virtual address is chosen by the ELF itself: it is fixed and determined at compile time, and such an ELF cannot be loaded at a different address in order for it to work.
For ELF Shared Objects (i.e. e_type = ET_DYN), which nowadays are the norm, the base virtual address is chosen by the kernel itself and is defined by ELF_ET_DYN_BASE (adjusted if ASLR is enabled). This other answer of mine covers x86. This value is above TASK_UNMAPPED_BASE, so you will see the ELF above the "mmap" area (higher addresses) in the legacy layout, and below it (lower addresses) in the default layout.
The "heap" area (a.k.a. the program break) by definition will start right after the ELF growing towards high addresses regardless.
Here's a couple of annotated screenshots (click to enlarge) to show what the default vs legacy layouts look like inspecting /proc/[pid]/maps on my x86-64 machine. Note that low addresses are at the top.
Default:
Legacy:

How exactly do kernel virtual addresses get translated to physical RAM?

On the surface, this appears to be a silly question. Some patience please.. :-)
Am structuring this qs into 2 parts:
Part 1:
I fully understand that platform RAM is mapped into the kernel segment; esp on 64-bit systems this will work well. So each kernel virtual address is indeed just an offset from physical memory (DRAM).
Also, it's my understanding that as Linux is a modern virtual memory OS, (pretty much) all addresses are treated as virtual addresses and must "go" via hardware - the TLB/MMU - at runtime and then get translated by the TLB/MMU via kernel paging tables. Again, easy to understand for user-mode processes.
HOWEVER, what about kernel virtual addresses? For efficiency, would it not be simpler to direct-map these (and an identity mapping is indeed setup from PAGE_OFFSET onwards). But still, at runtime, the kernel virtual address must go via the TLB/MMU and get translated right??? Is this actually the case? Or is kernel virtual addr translation just an offset calculation?? (But how can that be, as we must go via hardware TLB/MMU?). As a simple example, lets consider:
char *kptr = kmalloc(1024, GFP_KERNEL);
Now kptr is a kernel virtual address.
I understand that virt_to_phys() can perform the offset calculation and return the physical DRAM address.
But, here's the Actual Question: it can't be done in this manner via software - that would be pathetically slow! So, back to my earlier point: it would have to be translated via hardware (TLB/MMU).
Is this actually the case??
Part 2:
Okay, lets say this is the case, and we do use paging in the kernel to do this, we must of course setup kernel paging tables; I understand it's rooted at swapper_pg_dir.
(I also understand that vmalloc() unlike kmalloc() is a special case- it's a pure virtual region that gets backed by physical frames only on page fault).
If (in Part 1) we do conclude that kernel virtual address translation is done via kernel paging tables, then how exactly does the kernel paging table (swapper_pg_dir) get "attached" or "mapped" to a user-mode process?? This should happen in the context-switch code? How? Where?
Eg.
On an x86_64, 2 processes A and B are alive, 1 cpu.
A is running, so it's higher-canonical addr
0xFFFF8000 00000000 through 0xFFFFFFFF FFFFFFFF "map" to the kernel segment, and it's lower-canonical addr
0x0 through 0x00007FFF FFFFFFFF map to it's private userspace.
Now, if we context-switch A->B, process B's lower-canonical region is unique But
it must "map" to the same kernel of course!
How exactly does this happen? How do we "auto" refer to the kernel paging table when
in kernel mode? Or is that a wrong statement?
Thanks for your patience, would really appreciate a well thought out answer!

First a bit of background.
This is an area where there is a lot of potential variation between
architectures, however the original poster has indicated he is mainly
interested in x86 and ARM, which share several characteristics:
no hardware segments or similar partitioning of the virtual address space (when used by Linux)
hardware page table walk
multiple page sizes
physically tagged caches (at least on modern ARMs)
So if we restrict ourselves to those systems it keeps things simpler.
Once the MMU is enabled, it is never normally turned off. So all CPU
addresses are virtual, and will be translated to physical addresses
using the MMU. The MMU will first look up the virtual address in the
TLB, and only if it doesn't find it in the TLB will it refer to the
page table - the TLB is a cache of the page table - and so we can
ignore the TLB for this discussion.
The page table
describes the entire virtual 32 or 64 bit address space, and includes
information like:
whether the virtual address is valid
which mode(s) the processor must be in for it to be valid
special attributes for things like memory mapped hardware registers
and the physical address to use
Linux divides the virtual address space into two: the lower portion is
used for user processes, and there is a different virtual to physical
mapping for each process. The upper portion is used for the kernel,
and the mapping is the same even when switching between different user
processes. This keep things simple, as an address is unambiguously in
user or kernel space, the page table doesn't need to be changed when
entering or leaving the kernel, and the kernel can simply dereference
pointers into user space for the
current user process. Typically on 32bit processors the split is 3G
user/1G kernel, although this can vary. Pages for the kernel portion
of the address space will be marked as accessible only when the processor
is in kernel mode to prevent them being accessible to user processes.
The portion of the kernel address space which is identity mapped to RAM
(kernel logical addresses) will be mapped using big pages when possible,
which may allow the page table to be smaller but more importantly
reduces the number of TLB misses.
When the kernel starts it creates a single page table for itself
(swapper_pg_dir) which just describes the kernel portion of the
virtual address space and with no mappings for the user portion of the
address space. Then every time a user process is created a new page
table will be generated for that process, the portion which describes
kernel memory will be the same in each of these page tables. This could be
done by copying all of the relevant portion of swapper_pg_dir, but
because page tables are normally a tree structures, the kernel is
frequently able to graft the portion of the tree which describes the
kernel address space from swapper_pg_dir into the page tables for each
user process by just copying a few entries in the upper layer of the
page table structure. As well as being more efficient in memory (and possibly
cache) usage, it makes it easier to keep the mappings consistent. This
is one of the reasons why the split between kernel and user virtual
address spaces can only occur at certain addresses.
To see how this is done for a particular architecture look at the
implementation of pgd_alloc(). For example ARM
(arch/arm/mm/pgd.c) uses:
pgd_t *pgd_alloc(struct mm_struct *mm)
{
...
init_pgd = pgd_offset_k(0);
memcpy(new_pgd + USER_PTRS_PER_PGD, init_pgd + USER_PTRS_PER_PGD,
(PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
...
}
or
x86 (arch/x86/mm/pgtable.c) pgd_alloc() calls pgd_ctor():
static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
{
/* If the pgd points to a shared pagetable level (either the
ptes in non-PAE, or shared PMD in PAE), then just copy the
references from swapper_pg_dir. */
...
clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
KERNEL_PGD_PTRS);
...
}
So, back to the original questions:
Part 1: Are kernel virtual addresses really translated by the TLB/MMU?
Yes.
Part 2: How is swapper_pg_dir "attached" to a user mode process.
All page tables (whether swapper_pg_dir or those for user processes)
have the same mappings for the portion used for kernel virtual
addresses. So as the kernel context switches between user processes,
changing the current page table, the mappings for the kernel portion
of the address space remain the same.

The kernel address space is mapped to a section of each process for example on 3:1 mapping after address 0xC0000000. If the user code try to access this address space it will generate a page fault and it is guarded by the kernel.
The kernel address space is divided into 2 parts, the logical address space and the virtual address space. It is defined by the constant VMALLOC_START. The CPU is using the MMU all the time, in user space and in kernel space (can't switch on/off).
The kernel virtual address space is mapped the same way as user space mapping. The logical address space is continuous and it is simple to translate it to physical so it can be done on demand using the MMU fault exception. That is the kernel is trying to access an address, the MMU generate fault , the fault handler map the page using macros __pa , __va and change the CPU pc register back to the previous instruction before the fault happened, now everything is ok. This process is actually platform dependent and in some hardware architectures it mapped the same way as user (because the kernel doesn't use a lot of memory).

Linux uses Paging or Segmentation or Both? [duplicate]

I'm reading "Understanding Linux Kernel". This is the snippet that explains how Linux uses Segmentation which I didn't understand.
Segmentation has been included in 80 x
86 microprocessors to encourage
programmers to split their
applications into logically related
entities, such as subroutines or
global and local data areas. However,
Linux uses segmentation in a very
limited way. In fact, segmentation
and paging are somewhat redundant,
because both can be used to separate
the physical address spaces of
processes: segmentation can assign a
different linear address space to each
process, while paging can map the same
linear address space into different
physical address spaces. Linux prefers
paging to segmentation for the
following reasons:
Memory management is simpler when all
processes use the same segment
register values that is, when they
share the same set of linear
addresses.
One of the design objectives of Linux
is portability to a wide range of
architectures; RISC architectures in
particular have limited support for
segmentation.
All Linux processes running in User
Mode use the same pair of segments to
address instructions and data. These
segments are called user code segment
and user data segment , respectively.
Similarly, all Linux processes running
in Kernel Mode use the same pair of
segments to address instructions and
data: they are called kernel code
segment and kernel data segment ,
respectively. Table 2-3 shows the
values of the Segment Descriptor
fields for these four crucial
segments.
I'm unable to understand 1st and last paragraph.

The 80x86 family of CPUs generate a real address by adding the contents of a CPU register called a segment register to that of the program counter. Thus by changing the segment register contents you can change the physical addresses that the program accesses. Paging does something similar by mapping the same virtual address to different real addresses. Linux using uses the latter - the segment registers for Linux processes will always have the same unchanging contents.

Segmentation and Paging are not at all redundant. The Linux OS fully incorporates demand paging, but it does not use memory segmentation. This gives all tasks a flat, linear, virtual address space of 32/64 bits.
Paging adds on another layer of abstraction to the memory address translation. With paging, linear memory addresses are mapped to pages of memory, instead of being translated directly to physical memory. Since pages can be swapped in and out of physical RAM, paging allows more memory to be allocated than what is physically available. Only pages that are being actively used need to be mapped into physical memory.
An alternative to page swapping is segment swapping, but it is generally much less efficient given that segments are usually larger than pages.
Segmentation of memory is a method of allocating multiple chunks of memory (per task) for different purposes and allowing those chunks to be protected from each other. In Linux a task's code, data, and stack sections are all mapped to a single segment of memory.
The 32-bit processors do not have a mode bit for disabling
segmentation, but the same effect can be achieved by mapping the
stack, code, and data spaces to the same range of linear addresses.
The 32-bit offsets used by 32-bit processor instructions can cover a
four-gigabyte linear address space.
Aditionally, the Intel documentation states:
A flat model without paging minimally requires a GDT with one code and
one data segment descriptor. A null descriptor in the first GDT entry
is also required. A flat model with paging may provide code and data
descriptors for supervisor mode and another set of code and data
descriptors for user mode
This is the reason for having a one pair of CS/DS for kernel privilege execution (ring 0), and one pair of CS/DS for user privilege execution (ring 3).
Summary: Segmentation provides a means to isolate and protect sections of memory. Paging provides a means to allocate more memory that what is physically available.

Windows uses the fs segment for local thread storage.
Therefore, wine has to use it, and the linux kernel needs to support it.

Modern operating systems (i.e. Linux, other Unixen, Windows NT, etc.) do not use the segmentation facility provided by the x86 processor. Instead, they use a flat 32 bit memory model. Each user mode process has it's own 32 bit virtual address space.
(Naturally the widths are expanded to 64 bits on x86_64 systems)

Intel first added segmentation on the 80286, and then paging on the 80386. Unix-like OSes typically use paging for virtual memory.
Anyway, since paging on x86 didn't support execute permissions until recently, OpenWall Linux used segmentation to provide non-executable stack regions, i.e. it set the code segment limit to a lower value than the other segment's limits, and did some emulation to support trampolines on the stack.

x86 - segmentation in protected mode serves what purpose?

I read about the x86 memory segmentation and I think that I'm missing something,
the linear(virtual) address is built by taking the 32-bit from the GDT entry (base address), taking the 32-bits from the offset address and sum them to get a 32 bit virtual address.
Now as I see it the 32 offset bits can span the all VA space so there isn't really a need to use the 32 bits base address. So I conclude that the base address didn't really take a role in the translating process, what brings me
to the point that the memory protection using segmentation (in x86 protected mode) is useless because we can get VA of segments with ring 0 privileges with the offset address itself. (EG. jump 0x08000001 - to kernel VA when our segment has the ring 3 privilege)
So all memory protection we've got based on paging?

Segment selector and Segment Descriptor contains a data about a boundary of memory segment.
Not only a boundary, but also an access type has been contained in the descriptor.
0 to 3 privilege level, less has more privilege. Also read-write-execute information.
So each different privilege level in segment has different access authority,
and it's a part of fundamental mechanism of protected mode.
Legacy segmentation just prevents a duplicate of each segment area, otherwise code segment would be contaminated by data segment or stack segment memory of application (or user).
Recent segmentation contains more certain protection method with segment selector and segment descriptor.
After booting procedure, system is entering user privilege mode (3) and after than you can't access kernel privilege mode (0) unless using rootkit or perhaps there is another way for skillful hackers. :)

Your observation that the 32-bit offset can span the entire VA space is correct. But segment descriptors also include a limit, so any accesses beyond that limit using that segment will cause a #GP (general protection fault). Also, you can't just use a ring-0 segment in ring-3 code; that would defeat the purpose of ring levels in the first place.

Segmentation registers use

I am trying to understand how memory management goes on low level and have a couple of questions.
1) A book about assembly language by by Kip R. Irvine says that in the real mode first three segment registers are loaded with base addresses of code, data, and stack segment when the program starts. This is a bit ambigous to me. Are these values specified manually or does the assembler generates instructions to write the values into registers? If it happens automatically, how it finds out what is the size of these segments?
2) I know that Linux uses flat linear model, i.e. uses segmentation in a very limited way. Also, according to "Understanding the Linux Kernel" by Daniel P. Bovet and Marco Cesati there are four main segments: user data, user code, kernel data and kernel code in GDT. All four segments have the same size and base address. I do not understand why there is need in four of them if they differ only in type and access rights (they all produce the same linear address, right?). Why not use just one of them and write its descriptor to all segment registers?
3) How operating systems that do not use segmentation divide programs into logical segments? For example, how they differentiate stack from code without segment descriptors. I read that paging can be used to handle such things, but don't understand how.

You must have read some really old books because nobody program for real-mode anymore ;-) In real-mode, you can get the physical address of a memory access with physical address = segment register * 0x10 + offset, the offset being a value inside one of the general-purpose registers. Because these registers are 16 bit wide, a segment will be 64kb long and there is nothing you can do about its size, just because there is no attribute! With the * 0x10 multiplication, 1mb of memory become available, but there are overlapping combinations depending on what you put in the segment registers and the address register. I haven't compiled any code for real-mode, but I think it's up to the OS to setup the segment registers during the the binary loading, just like a loader would allocate some pages when loading an ELF binary. However I do have compiled bare-metal kernel code, and I had to setup these registers by myself.
Four segments are mandatory in the flat model because of architecture constraints. In protected-mode the segment registers no more contains the segment base address, but a segment selector which is basically an offset into the GDT. Depending on the value of the segment selector, the CPU will be in a given level of privilege, this is the CPL (Current Privilege Level). The segment selector points to a segment descriptor which has a DPL (Descriptor Privilege Level), which is eventually the CPL if the segment register is filled with with this selector (at least true for the code-segment selector). Therefore you need at least a pair of segment selectors to differentiate the kernel from the userland. Moreover, segments are either code segment or data segment, so you eventually end up with four segment descriptors in the GDT.
I don't have any example of serious OS which make any use of segmentation, just because segmentation is still present for backward compliancy. Using the flat model approach is nothing but a mean to get rid of it. Anyway, you're right, paging is way more efficient and versatile, and available on almost all architecture (the concepts at least). I can't explain here paging internals, but all the information you need to know are inside the excellent Intel man: Intel® 64 and IA-32 Architectures
Software Developer’s Manual
Volume 3A:
System Programming Guide, Part 1

Expanding on Benoit's answer to question 3...
The division of programs into logical parts such as code, constant data, modifiable data and stack is done by different agents at different points in time.
First, your compiler (and linker) creates executable files where this division is specified. If you look at a number of executable file formats (PE, ELF, etc), you'll see that they support some kind of sections or segments or whatever you want to call it. Besides addresses and sizes and locations within the file, those sections bear attributes telling the OS the purpose of these sections, e.g. this section contains code (and here's the entry point), this - initialized constant data, that - uninitialized data (typically not taking space in the file), here's something about the stack, over there is the list of dependencies (e.g. DLLs), etc.
Next, when the OS starts executing the program, it parses the file to see how much memory the program needs, where and what memory protection is needed for every section. The latter is commonly done via page tables. The code pages are marked as executable and read-only, the constant data pages are marked as not executable and read-only, other data pages (including those of the stack) are marked as not executable and read-write. This is how it ought to be normally.
Often times programs need read-write and, at the same time, executable regions for dynamically generated code or just to be able to modify the existing code. The combined RWX access can be either specified in the executable file or requested at run time.
There can be other special pages such as guard pages for dynamic stack expansion, they're placed next to the stack pages. For example, your program starts with enough pages allocated for a 64KB stack and then when the program tries to access beyond that point, the OS intercepts access to those guard pages, allocates more pages for the stack (up to the maximum supported size) and moves the guard pages further. These pages don't need to be specified in the executable file, the OS can handle them on its own. The file should only specify the stack size(s) and perhaps the location.
If there's no hardware or code in the OS to distinguish code memory from data memory or to enforce memory access rights, the division is very formal. 16-bit real-mode DOS programs (COM and EXE) didn't have code, data and stack segments marked in some special way. COM programs had everything in one common 64KB segment and they started with IP=0x100 and SP=0xFFxx and the order of code and data could be arbitrary inside, they could intertwine practically freely. DOS EXE files only specified the starting CS:IP and SS:SP locations and beyond that the code, data and stack segments were indistinguishable to DOS. All it needed to do was load the file, perform relocation (for EXEs only), set up the PSP (Program Segment Prefix, containing the command line parameter and some other control info), load SS:SP and CS:IP. It could not protect memory because memory protection isn't available in the real address mode, and so the 16-bit DOS executable formats were very simple.

Wikipedia is your friend in this case. http://en.wikipedia.org/wiki/Memory_segmentation and http://en.wikipedia.org/wiki/X86_memory_segmentation should be good starting points.
I'm sure there are others here who can personally provide in-depth explanations, though.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string