Possible to insert an existing page into another VMA structure? - linux

I`ve been learning the linux kernel through some experiments.Recently i m wondering whether it is possible to share pages between two user-space processes by inserting the pages of one process into the vma structure of the other one, after the latter calls mmap and sends the addr back to kernel through netlink.The insertion would be done in a driver module.The reason for this test,is that the two processes might not be directly communicating with each other,and duplicate pages of read-only memory could be a bad choice considering efficiency and redundancy.
And after some research I found the vm_insert_page function and the traditional remap_pfn_range. However It says in the lxr:
/**
2020 * vm_insert_page - insert single page into user vma
2021 * #vma: user vma to map to
2022 * #addr: target user address of this page
2023 * #page: source kernel page
2024 *
2025 * This allows drivers to insert individual pages they've allocated
2026 * into a user vma.
2027 *
2028 * The page has to be a nice clean individual kernel allocation."
from lxr
Does this mean it`s impossible to insert an existing page into another vma?The function can Only be called with newly created pages?I always thought pages could be sharing with a reference count number.

Related

Trace page table access of a Linux process

I am writing to inquire the feasibility of tracing the page table access (in terms of "index" of each page table access) of a common Linux user application. Basically, what I am doing is to re-produce the exploitation mentioned in this research article (https://www.ieee-security.org/TC/SP2015/papers-archived/6949a640.pdf). In particular, the data-page accesses need to be recorded for usage and inference of program secrets.
I understand the on Linux system, 64-bit x86 architecture, the page table size is 4K. And i have used pin (https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool) to log a trace of addresses for all virtual memory access. So can I simply calculate the "index" of each data page table access, with the following translation rule?
index = address >> 15
Since 4KB = 2 ^ 15. Is it correct? Thank you in advance for any suggestions or comments.
Also, I think one thing I want to point out is that conceptually, I don't need a "precise" identifier of each data page table ID, but just a number ("index") to distinguish the access of different data pages. This shall provide conceptually identical amount of information compared with their attacks.
Ok, so you don't really need an "index", but just some unique identifier to distinguish different pages in the virtual address of a process.
In such case, then you can just do address >> PAGE_SHIFT. In x86 with 4KB pages PAGE_SHIFT is 12, so you can do:
page_id = address >> 12
Then if address1 and address2 correspond to the same page the page_id will be the same for both addresses.
Alternatively, to achieve the same result, you could do address & PAGE_MASK, where PAGE_MASK is just 0xfffffffffffff000 (that is ~((1UL << PAGE_SHIFT) - 1)).

how to determine page(cache) boundaries when writing to a file

In linux when writing to a file, kernel maintains multiple in memory pages (4KB in size). Data is first written to the pages and background process bdflush sends these data to disk drive.
Is there a way to determine page boundaries when writing sequentially to a file ?
Can I assume it is always 1-4096:page 1 and 4097-8192:page 2 ?
or can it vary ?
say if I start writing from 10 (i.e. first 10 bytes already written to the file previously and I set file position to 10 before start writing) will the page boundary still be
1-4096 : page 1
OR
10-5096 : page 1 ?
Reason for asking,
I can use sync_file_range
http://man7.org/linux/man-pages/man2/sync_file_range.2.html
to flush data from kernel pages to disk drive in a orderly manner.
If I can determine page boundaries I can call sync_file_range only when a page boundary is reached, so that unnecessary sync_file_range calls are avoided.
Edit :
Only positive thing to suggest such boundary alignment I could find was in mmap page asking offset to be multiple of page size
,
offset must be a multiple of the page size as
returned by sysconf(_SC_PAGE_SIZE)

Buffer overflow exploitation 101

I've heard in a lot of places that buffer overflows, illegal indexing in C like languages may compromise the security of a system. But in my experience all it does is crash the program I'm running. Can anyone explain how buffer overflows could cause security problems? An example would be nice.
I'm looking for a conceptual explanation of how something like this could work. I don't have any experience with ethical hacking.
First, buffer overflow (BOF) are only one of the method of gaining code execution. When they occur, the impact is that the attacker basically gain control of the process. This mean that the attacker will be able to trigger the process in executing any code with the current process privileges (depending if the process is running with a high or low privileged user on the system will respectively increase or reduce the impact of exploiting a BOF on that application). This is why it is always strongly recommended to run applications with the least needed privileges.
Basically, to understand how BOF works, you have to understand how the code you have build gets compiled into machine code (ASM) and how data managed by your software is stored in memory.
I will try to give you a basic example of a subcategory of BOF called Stack based buffer overflows :
Imagine you have an application asking the user to provide a username.
This data will be read from user input and then stored in a variable called USERNAME. This variable length has been allocated as a 20 byte array of chars.
For this scenario to work, we will consider the program's do not check for the user input length.
At some point, during the data processing, the user input is copied to the USERNAME variable (20bytes) but since the user input is longer (let's say 500 bytes) data around this variable will be overwritten in memory :
Imagine such memory layout :
size in bytes 20 4 4 4
data [USERNAME][variable2][variable3][RETURN ADDRESS]
If you define the 3 local variables USERNAME, variable2 and variable3 the may be store in memory the way it is shown above.
Notice the RETURN ADDRESS, this 4 byte memory region will store the address of the function that has called your current function (thanks to this, when you call a function in your program and readh the end of that function, the program flow naturally go back to the next instruction just after the initial call to that function.
If your attacker provide a username with 24 x 'A' char, the memory layout would become something like this :
size in bytes 20 4 4 4
data [USERNAME][variable2][variable3][RETURN ADDRESS]
new data [AAA...AA][ AAAA ][variable3][RETURN ADDRESS]
Now, if an attacker send 50 * the 'A' char as a USERNAME, the memory layout would looks like this :
size in bytes 20 4 4 4
data [USERNAME][variable2][variable3][RETURN ADDRESS]
new data [AAA...AA][ AAAA ][ AAAA ][[ AAAA ][OTHER AAA...]
In this situation, at the end of the execution of the function, the program would crash because it will try to reach the address an invalid address 0x41414141 (char 'A' = 0x41) because the overwritten RETURN ADDRESS doesn't match a correct code address.
If you replace the multiple 'A' with well thought bytes, you may be able to :
overwrite RETURN ADDRESS to an interesting location.
place "executable code" in the first 20 + 4 + 4 bytes
You could for instance set RETURN ADDRESS to the address of the first byte of the USERNAME variable (this method is mostly no usable anymore thanks to many protections that have been added both to OS and to compiled programs).
I know it is quite complex to understand at first, and this explanation is a very basic one. If you want more detail please just ask.
I suggest you to have a look at great tutorials like this one which are quite advanced but more realistic

Reading /dev/urandom as early as possible

I am performing research in the field of random number generation and I need to demonstrate the "boot-time entropy hole" from the well-known "P's and Q's" paper (here). We will be spooling up two copies of the same minimal Linux virtual machine at the same time and we are expecting their /dev/urandom values to be the same at some early point in the boot process.
However, I have been unable to read /dev/urandom early enough in the boot process to spot the issue. We need to the earlier in the boot process.
How can I get the earliest possible values of /dev/urandom? We likely will need to modify the kernel, but we have very little experience there, and need some pointers. Or, if there's a kernel-instrumenting tool available that could do it without re-compiling a kernel, that would be great, too.
Thanks in advance!
urandom is provided via device driver and the first thing kernel does with the driver is to call init call.
If you take a look here: http://lxr.free-electrons.com/source/drivers/char/random.c#L1401
* Note that setup_arch() may call add_device_randomness()
* long before we get here. This allows seeding of the pools
* with some platform dependent data very early in the boot
* process. But it limits our options here. We must use
* statically allocated structures that already have all
* initializations complete at compile time. We should also
* take care not to overwrite the precious per platform data
* we were given.
*/
static int rand_initialize(void)
{
init_std_data(&input_pool);
init_std_data(&blocking_pool);
init_std_data(&nonblocking_pool);
return 0;
}
early_initcall(rand_initialize);
So, init function for this driver is rand_initialize. However note that comment says that setup_arch may call add_device randomness() before this device is even initialized. However, calling that function does not add any actual entropy (it feeds the pool with stuff like MAC addresses, so if you have two exactly the same VMs, you're good there). From the comment:
* add_device_randomness() is for adding data to the random pool that
* is likely to differ between two devices (or possibly even per boot).
* This would be things like MAC addresses or serial numbers, or the
* read-out of the RTC. This does *not* add any actual entropy to the
* pool, but it initializes the pool to different values for devices
* that might otherwise be identical and have very little entropy
* available to them (particularly common in the embedded world).
Also, note that entropy pools are stored on shutdown and restored on boot time via init script (on my Ubuntu 14.04, it's in /etc/init.d/urandom), so you might want to call your script from that script before
53 (
54 date +%s.%N
55
56 # Load and then save $POOLBYTES bytes,
57 # which is the size of the entropy pool
58 if [ -f "$SAVEDFILE" ]
59 then
60 cat "$SAVEDFILE"
61 fi
62 # Redirect output of subshell (not individual commands)
63 # to cope with a misfeature in the FreeBSD (not Linux)
64 # /dev/random, where every superuser write/close causes
65 # an explicit reseed of the yarrow.
66 ) >/dev/urandom
or similar call is made.

How does Linux Process Accounting (psacct) work?

I find a lot documents about psacct, but they are addressing usage, not how it works.
Question
I really want to know how process accounting works:
Which part of the system records information about processes?
How does it work?
Already done
I installed psacct on RHEL 6.5.
The service staring script actually (/etc/init.d/psacct) call this:
/sbin/accton $ACCTFILE
The /sbin/accton calls system call acct()
man acct
DESCRIPTION
The acct() system call enables or disables process accounting. If called with the name of an existing file as its argument, accounting is
turned on, and records for each terminating process are appended to filename as it terminates. An argument of NULL causes accounting to be
turned off.
The answer to your question is in the linux source file kernel/acct.c. Particularly in the fill_ac function
/*
* Write an accounting entry for an exiting process
*
* The acct_process() call is the workhorse of the process
* accounting system. The struct acct is built here and then written
* into the accounting file. This function should only be called from
* do_exit() or when switching to a different output file.
*/
static void fill_ac(acct_t *ac)

Resources