PTE structure in the linux kernel

PTE structure in the linux kernel - linux

I have been trying to look around in the linux source code for a structure/union that'd correspond to the PTE on x86 system with PAE disabled. So far I've found only the following in arch/x86/include/asm/page_32.h
typedef union {
pteval_t pte;
pteval_t pte_low;
} pte_t;
I'm a bit confused right now since I have the Intel Reference Manual Vol 3A open in front of me and nothing in that union corresponds to the dozen odd fields present in the PTE as the manual explains.
This might be a trivial question but for me it has become more like a stumbling block in the process of understanding memory management in the linux kernel.
EDIT: I have the 2.6.29 source with me

The pteval_t just treats the page table entry as an opaque blob - on the architecture you're looking at, it's just a 32 bit unsigned value.
The fields within the PTE are accessed using bitwise operators and masks - in the source I have handy (Linux 2.6.24), these are defined in include/asm-x86/pgtable_32.h. The fields you see in the Intel Reference Manual (most of which are single-bit flags) are defined here - for example:
#define _PAGE_PRESENT 0x001
#define _PAGE_RW 0x002
#define _PAGE_USER 0x004
#define _PAGE_PWT 0x008
#define _PAGE_PCD 0x010
#define _PAGE_ACCESSED 0x020
#define _PAGE_DIRTY 0x040
#define _PAGE_PSE 0x080 /* 4 MB (or 2MB) page, Pentium+, if present.. */
#define _PAGE_GLOBAL 0x100 /* Global TLB entry PPro+ */
#define _PAGE_UNUSED1 0x200 /* available for programmer */
#define _PAGE_UNUSED2 0x400
#define _PAGE_UNUSED3 0x800

I would recommend buying Understanding the Linux Kernel from O'REILLY, as well as Linux Device Drivers. And subscribing to LWN.net; though you can get a pretty good start from their kernel index page even without a subscription.
For the memory management, look on the index page for the "Memory management" section... and the "Large-memory systems" section. The latter has a few articles that talk about the move to four-level page tables that should be helpful in understanding this area of the code.

Related

Cannot Make Code Segment Execute-Only (Not Readable)

I'm trying to make the Code Segment Execute-Only (Not Readable).
But I FAILED after I tried everything the Manual told me to. Here is what I did to make the code segment unreadable.
>uname -a
Linux Emmet-VM 3.19.0-25-generic #26~14.04.1-Ubuntu SMP Fri Jul 24 21:18:00 UTC 2015 i686 i686 i686 GNU/Linux
>lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty
First, I've found this in "Intel(R)64 and IA-32 Architectures Software Developer's Manual(Combined Volumes 1,2A,2B,2C,2D,3A,3B,3C and 3D)":
Set read-enable bit to enable read and Segment Types.(Sorry, I'm still not allowed to embed pictures in my posts, so links instead)
So, I guess if I change %CS, and let it point to a Segment Descriptor which has read-enable bit set as 0, I should make the Code Segment not readable.
Then, I use the code below to insert a new Segment into LDT.entry[2], and I do set the code segment type to 8, aka 1000B, which means "Execute-Only" according to "Segment Types" link posted above:
typedef struct user_desc UserDesc;
UserDesc *seg = (UserDesc*)malloc(sizeof(UserDesc));
seg->entry_number = 0x2;
seg->base_addr = 0x00000000;
seg->limit = 0xffffffff;
seg->seg_32bit = 0x1;
seg->contents = 0x02;
seg->read_exec_only = 0x1;
seg->limit_in_pages = 0x1;
seg->seg_not_present = 0x0;
seg->useable = 0x0;
int ret = modify_ldt(1, (void*)seg, sizeof(UserDesc));
After that, I change %CS to 0x17(00010111B, meaning the entry 2 in LDT) with ljmp.
asm("ljmp $0x17, $reload_cs\n"
"reload_cs:");
But, even with this, I still can read the byte code in code segment:
void foo() {printf("foo\n");}
void test(){
char* a = (char*)foo;
printf("0x%x\n", (unsigned int)a[0]);// This prints 0x55
}
If the code segment is unreadable, code above should throw a segment fault error. But it prints 0x55 successfully.
So, I wonder, is there any mistake I've made during my test?
Or is this just a mistake in Intel's Manual?

You are still accessing the code through DS when doing (unsigned int)a[0].
Write only segments don't exist (and if they did, it would be a bad idea to set DS write only).
If you did everything correctly mov eax, [cs:...] (NASM syntax) will fail (but mov eax, [ds:...] won't).
After a quick glance at the Intel Manual execute only pages should not exist (at least directly), so using mprotect with PROT_EXEC may be of limited use (the code would still be readable).
Worth a shot, though.
There are three ways around this.
None of which can be implemented without the aid of the OS though, so they are more theoretical than practical.
Protection keys
If the CPU supports them (See section 4.6.2 of the Intel manual 3), they introduce an asymmetry in how code and data are read.
Reading data is subject to the key protection.
Fetching however is not:
How a linear address’s protection key controls access to the address depends on the mode of a linear address:
A linear address’s protection controls only data accesses to the address. It does not in any way affect instructions fetches from the address.
So it's possible to set a protection key for the code pages that your application don't have in its PKRU register.
You would still be allowed to execute the code but not to read it.
Desync the TLBs
If your application has never touched the code pages for reading, they will occupy some entries in the ITLB but not in the DTLB.
If then, the OS map them as supervisor-only without flushing the TLBs, access to them is prevented when accessed as data (since no DTLB entries for those pages are present, forcing a walk on the memory) but thanks to the ITLB the code can still be fetched.
This is more involved in practice as code span multiple pages and is actually read as data by the OS.
EPT
The Extended Data Pages are used during virtualization to translate Guest physical addresses to Host physical addresses.
Though they seems just another level of indirection, they have separate Read, Write and Execute control bits.
A paper has been written about preventing the leakage of the kernel code (to counteract dynamic Return Oriented Programming).

Finding physical memory offset to video buffer

I am using Fedora 17 running on board with integrated graphics.
Given that I am able to manipulate content of physical memory, how can I find out the physical memory offsets that I could write to in order to display something on the screen?
I tried to lookup 0xB8000 and 0xB0000 offsets but they contain all 0xff.
Is there a specific pattern that starts video buffer in memory?
Is there any good source of information about this topic?
The root cause of my problem was that Linux is not using legacy video mode, so the memory at 0xB8000 is restricted (in my case read only). However issuing an interrupt can switch to other modes:
INT 10 - VIDEO - SET VIDEO MODE
AH = 00h
AL = desired video mode (see #00010)
Found on: http://www.delorie.com/djgpp/doc/rbinter/id/74/0.html

Living like it's 1989
#include <linux/fb.h>
#define DEV_MEM "/dev/fb0"
/* Screen parameters (probably via ioctl() and /sys. */
#define YRES 240
#define XRES 320
#define BYTES_PER_PIXEL (sizeof(unsigned short)) /* 16 bit pixels. */
#define MAP_SIZE XRES*YRES*BYTES_PER_PIXEL
unsigned short *map_lbase;
if((fd = open(DEV_MEM, O_RDWR | O_SYNC)) == -1) {
fprintf(stderr, "cannot open %s - are you root?\n", DEV_MEM);
exit(1);
}
// Map that page.
map_lbase = (unsigned short *)mmap(NULL, MAP_SIZE,
PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if((long)map_lbase == -1) {
perror("cannot mmap");
exit(1);
}
Humons - Framebuffer API doc, Framebuffer Doc dir.
Smart Humons - Internals, Deferred I/O doc or how to emulate memory mapped video.
You can not use 0xB8000 and 0xB0000 directly as those are physical addresses. I assume you are in user space and not writing a kernel driver. Under Linux, we normally have the MMU enable; In other words, we have virtual memory. Not all processes/users have access to video memory. However, if you are allowed, you can mmap a framebuffer device to your address space. It is best to let the kernel give you an address and not request a specific one.
Or see how professionals do it.
Man: mmap
Edit: If you aren't root, you can still use Unix permissions on /dev/fb0 (or which ever device) to give group permission to read/write or use some sort of login process that gives a user on the current tty permission.

Maybe you could start around here:
http://www.tldp.org/HOWTO/Framebuffer-HOWTO/
But modern video graphics are in no way as simple as "finding where in the meory is the VRAM" and writing there.

About BOOTCMD in Uboot

My board is S3C6410, When i read the source code of uboot. there is something troubles me.
#define CONFIG_BOOTCOMMAND "nand read 0xc0008000 0x100000 0x500000;bootm 0xc0008000"
what does it mean ? it read the data from address of nand: 0x100000 , size of: 0x500000. To 0xc0008000 (SD-RAM), is it ?..
But the start-address of SD-RAM is : 0x50000000, how does address of 0xc0008000 make sense ?...out of range ?
Thanks.
my sdram size: 256M... nand boot..
There is some configure about it.
#define MEMORY_BASE_ADDRESS 0x50000000
#define CONFIG_NR_DRAM_BANKS 1 /* we have 2 bank of DRAM */
#define PHYS_SDRAM_1 MEMORY_BASE_ADDRESS /* SDRAM Bank #1 */
//#define PHYS_SDRAM_1_SIZE 0x08000000 /* 64 MB */
#define PHYS_SDRAM_1_SIZE 0x10000000
#define CFG_FLASH_BASE 0x00000000

It looks like you are reading it right. Address of 0xc0008000 would be destination of the read from nand.
I'd suggest you stop board boot to get U-Boot command prompt. Then do printenv, that may show something in your target setup that overrides the source you have shown. Also try the command manually.

it means 0x100000 address and 0x500000 size of nand should be written at address 0xc0008000.

How does linux capability.h use 32-bit mask for 34 elements?

The file in /usr/include/linux/capability.h #defines 34 possible capabilities.
It goes like:
#define CAP_CHOWN 0
#define CAP_DAC_OVERRIDE 1
.....
#define CAP_MAC_ADMIN 33
#define CAP_LAST_CAP CAP_MAC_ADMIN
each process has capabilities defined thusly
typedef struct __user_cap_data_struct {
__u32 effective;
__u32 permitted;
__u32 inheritable;
} * cap_user_data_t;
I'm confused - a process can have 32-bits of effective capabilities, yet the total amount of capabilities defined in capability.h is 34. How is it possible to encode 34 positions in a 32-bit mask?

Because you haven't read all of the manual.
The capget manual starts by convincing you to not use it :
These two functions are the raw kernel interface for getting and set‐
ting thread capabilities. Not only are these system calls specific to
Linux, but the kernel API is likely to change and use of these func‐
tions (in particular the format of the cap_user_*_t types) is subject
to extension with each kernel revision, but old programs will keep
working.
The portable interfaces are cap_set_proc(3) and cap_get_proc(3); if
possible you should use those interfaces in applications. If you wish
to use the Linux extensions in applications, you should use the easier-
to-use interfaces capsetp(3) and capgetp(3).
Current details
Now that you have been warned, some current kernel details. The struc‐
tures are defined as follows.
#define _LINUX_CAPABILITY_VERSION_1 0x19980330
#define _LINUX_CAPABILITY_U32S_1 1
#define _LINUX_CAPABILITY_VERSION_2 0x20071026
#define _LINUX_CAPABILITY_U32S_2 2
[...]
effective, permitted, inheritable are bitmasks of the capabilities
defined in capability(7). Note the CAP_* values are bit indexes and
need to be bit-shifted before ORing into the bit fields.
[...]
Kernels prior to 2.6.25 prefer 32-bit capabilities with version
_LINUX_CAPABILITY_VERSION_1, and kernels 2.6.25+ prefer 64-bit capabil‐
ities with version _LINUX_CAPABILITY_VERSION_2. Note, 64-bit capabili‐
ties use datap[0] and datap[1], whereas 32-bit capabilities only use
datap[0].
where datap is defined earlier as a pointer to a __user_cap_data_struct. So you just represent a 64bit values with two __u32 in an array of two __user_cap_data_struct.
This, alone, tells me to not ever use this API, so i didn't read the rest of the manual.

They aren't bit-masks, they're just constants. E.G. CAP_MAC_ADMIN sets more than one bit. In binary, 33 is what, 10001?

Question about file seeking position

My previous Question is about raw data reading and writing, but a new problem arised, it seems there is no ending....
The question is: the parameters of the functions like lseek() or fseek() are all 4 bytes. If i want to move a span over 4G, that is imposible. I know in Win32, there is a function SetPointer(...,Hign, Low,....), this pointers can generate 64 byte pointers, which is what i want.
But if i want to create an app in Linux or Unix (create a file or directly write
the raw drive sectors), How can I move to a pointer over 4G?
Thanx, Waiting for your replies...

The offset parameter of lseek is of type off_t. In 32-bit compilation environments, this type defaults to a 32-bit signed integer - however, if you compile with this macro defined before all system includes:
#define _FILE_OFFSET_BITS 64
...then off_t will be a 64-bit signed type.
For fseek, the fseeko function is identical except that it uses the off_t type for the offset, which allows the above solution to work with it too.

a 4 byte unsigned integer can represent a value up to 4294967295, which means if you want to move more than 4G, you need to use lseek64(). In addition, you can use fgetpos() and fsetpos() to change the position in the file.

On Windows, use _lseeki64(), on Linux, lseek64().
I recommend to use lseek64() on both systems by doing something like this:
#ifdef _WIN32
#include <io.h>
#define lseek64 _lseeki64
#else
#include <unistd.h>
#endif
That's all you need.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string