Any good guides on using PTRACE_SYSEMU? - linux

Does anyone have any good explanations, tutorials, books, or guides on the use of PTRACE_SYSEMU?

What I found interesting:
Example Implementation for ptrace
Playing with ptrace, Part I - LinuxJournal.com
Playing with ptrace, Part II - LinuxJournal.com
And programming library that makes using ptrace easier :
PinkTrace - ptrace() wrapper library.
For pinktrace there are examples, sydbox sources are example of complex pinktrace usecase. In general, I've found author as good person to contact about using and testing pinktrace.

There is small test from linux kernel sources which uses PTRACE_SYSEMU:
http://code.metager.de/source/xref/linux/stable/tools/testing/selftests/x86/ptrace_syscall.c
or http://lxr.free-electrons.com/source/tools/testing/selftests/x86/ptrace_syscall.c
186 struct user_regs_struct regs;
187
188 printf("[RUN]\tSYSEMU\n");
189 if (ptrace(PTRACE_SYSEMU, chld, 0, 0) != 0)
190 err(1, "PTRACE_SYSCALL");
191 wait_trap(chld);
192
193 if (ptrace(PTRACE_GETREGS, chld, 0, &regs) != 0)
194 err(1, "PTRACE_GETREGS");
195
196 if (regs.user_syscall_nr != SYS_gettid ||
197 regs.user_arg0 != 10 || regs.user_arg1 != 11 ||
198 regs.user_arg2 != 12 || regs.user_arg3 != 13 ||
199 regs.user_arg4 != 14 || regs.user_arg5 != 15) {
200 printf("[FAIL]\tInitial args are wrong (nr=%lu, args=%lu %lu %lu %lu %lu %lu)\n", (unsigned long)regs.user_syscall_nr, (unsigned long)regs.user_arg0, (unsigned long)regs.user_arg1, (unsigned long)regs.user_arg2, (unsigned long)regs.user_arg3, (unsigned long)regs.user_arg4, (unsigned long)regs.user_arg5);
201 nerrs++;
202 } else {
203 printf("[OK]\tInitial nr and args are correct\n");
204 }
205
206 printf("[RUN]\tRestart the syscall (ip = 0x%lx)\n",
207 (unsigned long)regs.user_ip);
208
209 /*
210 * This does exactly what it appears to do if syscall is int80 or
211 * SYSCALL64. For SYSCALL32 or SYSENTER, though, this is highly
212 * magical. It needs to work so that ptrace and syscall restart
213 * work as expected.
214 */
215 regs.user_ax = regs.user_syscall_nr;
216 regs.user_ip -= 2;
217 if (ptrace(PTRACE_SETREGS, chld, 0, &regs) != 0)
218 err(1, "PTRACE_SETREGS");
219
220 if (ptrace(PTRACE_SYSEMU, chld, 0, 0) != 0)
221 err(1, "PTRACE_SYSCALL");
222 wait_trap(chld);
223
224 if (ptrace(PTRACE_GETREGS, chld, 0, &regs) != 0)
225 err(1, "PTRACE_GETREGS");
226
So, it looks like just another ptrace call which will allow program to run until next system call is made by it; then stop child and signal the ptracer. It can read registers, optionally change some and restart the syscall.
Implemented in http://lxr.free-electrons.com/source/kernel/ptrace.c?v=4.10#L1039 like other stepping ptrace calls:
1039 #ifdef PTRACE_SINGLESTEP
1040 case PTRACE_SINGLESTEP:
1041 #endif
1042 #ifdef PTRACE_SINGLEBLOCK
1043 case PTRACE_SINGLEBLOCK:
1044 #endif
1045 #ifdef PTRACE_SYSEMU
1046 case PTRACE_SYSEMU:
1047 case PTRACE_SYSEMU_SINGLESTEP:
1048 #endif
1049 case PTRACE_SYSCALL:
1050 case PTRACE_CONT:
1051 return ptrace_resume(child, request, data);
And man page has some info: http://man7.org/linux/man-pages/man2/ptrace.2.html
PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP (since Linux 2.6.14)
For PTRACE_SYSEMU, continue and stop on entry to the next
system call, which will not be executed. See the
documentation on syscall-stops below. For
PTRACE_SYSEMU_SINGLESTEP, do the same but also singlestep if
not a system call. This call is used by programs like User
Mode Linux that want to emulate all the tracee's system calls.
The data argument is treated as for PTRACE_CONT. The addr
argument is ignored. These requests are currently supported
only on x86.
So, it is not portable and used only for Usermode linux (um) on x86 platform as variant of classic PTRACE_SYSCALL. And um test for sysemu with some comments is here: http://lxr.free-electrons.com/source/arch/um/os-Linux/start_up.c?v=4.10#L155
155 __uml_setup("nosysemu", nosysemu_cmd_param,
156 "nosysemu\n"
157 " Turns off syscall emulation patch for ptrace (SYSEMU) on.\n"
158 " SYSEMU is a performance-patch introduced by Laurent Vivier. It changes\n"
159 " behaviour of ptrace() and helps reducing host context switch rate.\n"
160 " To make it working, you need a kernel patch for your host, too.\n"
161 " See http://perso.wanadoo.fr/laurent.vivier/UML/ for further \n"
162 " information.\n\n");
163
164 static void __init check_sysemu(void)
Link in comment was redirecting to secret site http://sysemu.sourceforge.net/ from 2004:
Why ?
UML uses ptrace() and PTRACE_SYSCALL to catch system calls. But, by
this way, you can't remove the real system call, only monitor it. UML,
to avoid the real syscall and emulate it, replaces the real syscall by
a call to getpid(). This method generates two context-switches instead
of one. A Solution
A solution is to change the behaviour of ptrace() to not call the real
syscall and thus we don't have to replace it by a call to getpid().
How ?
By adding a new command to ptrace(), PTRACE_SYSEMU, that acts like
PTRACE_SYSCALL without executing syscall. To add this command we need
to patch the host kernel. To use this new command in UML kernel, we
need to patch the UML kernel too.

Related

Permission denied while accessing /proc/<pid>/exe

I am having trouble accessing the file in /proc filesystem
My process once started writes in a log file .My process was stopped and when i checked the logfile to see where it encountered the problem and found "permission denied".
it goes to the /proc directory ,fetches PID via getPID() and fires open() using O_RDONLY to read /proc/<pid>/exe
but after firing i get an error "Permission denied".
I did some research and found that kernel enforces some restrictions while accessing certain files in/proc ,but i have 20 process all accessing the same /proc/<pid>/exe ,but only one facing this problem ..
CHAR fn[100];
159 CHAR args[500];
160 CHAR ProgName[50];
161 CHAR *arr[6];
162 CHAR *buf;
163 CHAR ProcessId[10];
164 static int count_try = 0;
165
166
167 memset(fn,0,100);
168 memset(ProcessId,0,10);
169 sprintf (ProcessId,"%d",Pid);
170 strcpy(fn, "/proc/");
171 strcat(fn, ProcessId);
172 //strcat(fn, "/elf_prpsinfo");
173 strcat(fn, "/exe");
174
175 if ((psp = open(fn, O_RDONLY)) == -1)
176 {
177 perror("GetProgName:ps open::");
178 exit(ERROR);
179 }

A thread who is spinning and trying to get the spinlock can't be preempted?

When a thread on Linux is spinning and trying to get the spinlock, Is there no chance this thread can be preempted?
EDIT:
I just want to make sure some thing. On a "UP" system, and there is no interrupt handler will access this spinlock. If the thread who is spinning and trying to get the spinlock can be preempted, I think in this case, the critical section which spinlock protects can call sleep, since the thread holding spinlock can be re-scheduled back to CPU.
No it cannot be preempted: see the code (taken from linux sources) http://lxr.free-electrons.com/source/include/linux/spinlock_api_smp.h?v=2.6.32#L241
241 static inline unsigned long __spin_lock_irqsave(spinlock_t *lock)
242 {
243 unsigned long flags;
244
245 local_irq_save(flags);
246 preempt_disable();
247 spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
248 /*
249 * On lockdep we dont want the hand-coded irq-enable of
250 * _raw_spin_lock_flags() code, because lockdep assumes
251 * that interrupts are not re-enabled during lock-acquire:
252 */
253 #ifdef CONFIG_LOCKDEP
254 LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
255 #else
256 _raw_spin_lock_flags(lock, &flags);
257 #endif
258 return flags;
259 }
260
[...]
349 static inline void __spin_unlock(spinlock_t *lock)
350 {
351 spin_release(&lock->dep_map, 1, _RET_IP_);
352 _raw_spin_unlock(lock);
353 preempt_enable();
354 }
see lines 246 and 353
By the way It is generally a bad idea to sleep while holding a lock (spinlock or not)

Using DMA API in linux kernel but channel is never available

I am trying to use dmatest.c to test DMA in intel xeon server and regular laptop with i7 processor. It is never been able to get a channel - I found this out by debugging the dmatest.c itself. Line 854 below is always executed (I put my own printk there).
Is there anything I should do to get this API to work before executing (such as dma modules or anything?)
Or, do I use wrong API set?
On the Xeon server, I did research and it has ioatdma.ko module that can be loaded.
modprobe ioatdma
and some files available at /sys/class/dma after that, such as dma0channel0, dma1channel0 .... etc
However, running dmatest code, it still can't get any channel.
Any help or hint is appreciated.
836 static void request_channels(struct dmatest_info *info,
837 enum dma_transaction_type type)
838 {
839 dma_cap_mask_t mask;
840
841 dma_cap_zero(mask);
842 dma_cap_set(type, mask);
843 for (;;) {
844 struct dmatest_params *params = &info->params;
845 struct dma_chan *chan;
846
847 chan = dma_request_channel(mask, filter, params);
848 if (chan) {
849 if (dmatest_add_channel(info, chan)) {
850 dma_release_channel(chan);
851 break; /* add_channel failed, punt */
852 }
853 } else
854 break; /* no more channels available */
The test commands that I used (following dmatest.txt document in kernel doc):
% echo dma0chan0 > /sys/kernel/debug/dmatest/channel
% echo 2000 > /sys/kernel/debug/dmatest/timeout
% echo 1 > /sys/kernel/debug/dmatest/iterations
% echo 1 > /sys/kernel/debug/dmatest/run

How does the Linux kernel get info about the processors and the cores?

Assume we have a blank computer without any OS and we are installing a Linux. Where in the kernel is the code that identifies the processors and the cores and get information about/from them?
This info eventually shows up in places like /proc/cpuinfo but how does the kernel get it in the first place?!
Short answer
Kernel uses special CPU instruction cpuid and saves results in internal structure - cpuinfo_x86 for x86
Long answer
Kernel source is your best friend.
Start from entry point - file /proc/cpuinfo.
As any proc file it has to be cretaed somewhere in kernel and declared with some file_operations. This is done at fs/proc/cpuinfo.c. Interesting piece is seq_open that uses reference to some cpuinfo_op. This ops are declared in arch/x86/kernel/cpu/proc.c where we see some show_cpuinfo function. This function is in the same file on line 57.
Here you can see
64 seq_printf(m, "processor\t: %u\n"
65 "vendor_id\t: %s\n"
66 "cpu family\t: %d\n"
67 "model\t\t: %u\n"
68 "model name\t: %s\n",
69 cpu,
70 c->x86_vendor_id[0] ? c->x86_vendor_id : "unknown",
71 c->x86,
72 c->x86_model,
73 c->x86_model_id[0] ? c->x86_model_id : "unknown");
Structure c declared on the first line as struct cpuinfo_x86. This structure is declared in arch/x86/include/asm/processor.h. And if you search for references on that structure you will find function cpu_detect and that function calls function cpuid which is finally resolved to native_cpuid that looks like this:
189 static inline void native_cpuid(unsigned int *eax, unsigned int *ebx,
190 unsigned int *ecx, unsigned int *edx)
191 {
192 /* ecx is often an input as well as an output. */
193 asm volatile("cpuid"
194 : "=a" (*eax),
195 "=b" (*ebx),
196 "=c" (*ecx),
197 "=d" (*edx)
198 : "" (*eax), "2" (*ecx)
199 : "memory");
200 }
And here you see assembler instruction cpuid. And this little thing does real work.
This information from BIOS + Hardware DB. You can get info direct by dmidecode, for example (if you need more info - try to check dmidecode source code)
sudo dmidecode -t processor

Why ISA doesn't need request_mem_region

I'm reading the source code of LDD3 Chapter 9. And there's an example for ISA driver named silly.
The following is initialization for the module. What I don't understand is why there's no call for "request_mem_region()" before invocation for ioremap() in line 282
268 int silly_init(void)
269 {
270 int result = register_chrdev(silly_major, "silly", &silly_fops);
271 if (result < 0) {
272 printk(KERN_INFO "silly: can't get major number\n");
273 return result;
274 }
275 if (silly_major == 0)
276 silly_major = result; /* dynamic */
277 /*
278 * Set up our I/O range.
279 */
280
281 /* this line appears in silly_init */
282 io_base = ioremap(ISA_BASE, ISA_MAX - ISA_BASE);
283 return 0;
284 }
This particular driver allows accesses to all the memory in the range 0xA0000..0x100000.
If there actually are any devices in this range, then it is likely that some other driver already has reserved some of that memory, so if silly were try to call request_mem_region, it would fail, or it would be necessary to unload that other driver before loading silly.
On a PC, this range contains memory of the graphics card, and the system BIOS:
$ cat /proc/iomem
...
000a0000-000bffff : PCI Bus 0000:00
000c0000-000cedff : Video ROM
000d0000-000dffff : PCI Bus 0000:00
000e4000-000fffff : reserved
000f0000-000fffff : System ROM
...
Unloading the graphics driver often is not possible (because it's not a module), and would prevent you from seeing what the silly driver does, and the ROM memory ranges are reserved by the kernel itself and cannot be freed.
TL;DR: Not calling request_mem_region is a particular quirk of the silly driver.
Any 'real' driver would be required to call it.

Resources