I am trying to use dmatest.c to test DMA in intel xeon server and regular laptop with i7 processor. It is never been able to get a channel - I found this out by debugging the dmatest.c itself. Line 854 below is always executed (I put my own printk there).
Is there anything I should do to get this API to work before executing (such as dma modules or anything?)
Or, do I use wrong API set?
On the Xeon server, I did research and it has ioatdma.ko module that can be loaded.
modprobe ioatdma
and some files available at /sys/class/dma after that, such as dma0channel0, dma1channel0 .... etc
However, running dmatest code, it still can't get any channel.
Any help or hint is appreciated.
836 static void request_channels(struct dmatest_info *info,
837 enum dma_transaction_type type)
838 {
839 dma_cap_mask_t mask;
840
841 dma_cap_zero(mask);
842 dma_cap_set(type, mask);
843 for (;;) {
844 struct dmatest_params *params = &info->params;
845 struct dma_chan *chan;
846
847 chan = dma_request_channel(mask, filter, params);
848 if (chan) {
849 if (dmatest_add_channel(info, chan)) {
850 dma_release_channel(chan);
851 break; /* add_channel failed, punt */
852 }
853 } else
854 break; /* no more channels available */
The test commands that I used (following dmatest.txt document in kernel doc):
% echo dma0chan0 > /sys/kernel/debug/dmatest/channel
% echo 2000 > /sys/kernel/debug/dmatest/timeout
% echo 1 > /sys/kernel/debug/dmatest/iterations
% echo 1 > /sys/kernel/debug/dmatest/run
Related
If vruntime is counted since creation of a process how come such a process even gets a processor if it is competing with a newly created processor-bound process which is younger let say by days?
As I've read the rule is simple: pick the leftmost leaf which is a process with the lowest runtime.
Thanks!
The kernel documentation for CFS kind of glosses over what would be the answer to your question, but mentions it briefly:
In practice, the virtual runtime of a task
is its actual runtime normalized to the total number of running tasks.
So, vruntime is actually normalized. But the documentation does not go into detail.
How is it actually done?
Normalization happens by means of a min_vruntime value. This min_vruntime value is recorded in the CFS runqueue (struct cfs_rq). The min_vruntime value is the smallest vruntime of all tasks in the rbtree. The value is also used to track all the work done by the cfs_rq.
You can observe an example of normalization being performed in CFS' enqueue_entity() code:
2998 static void
2999 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
3000 {
3001 /*
3002 * Update the normalized vruntime before updating min_vruntime
3003 * through calling update_curr().
3004 */
3005 if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
3006 se->vruntime += cfs_rq->min_vruntime;
3007
3008 /*
3009 * Update run-time statistics of the 'current'.
3010 */
3011 update_curr(cfs_rq);
...
3031 }
You can also observe in update_curr() how vruntime and min_vruntime are kept updated:
701 static void update_curr(struct cfs_rq *cfs_rq)
702 {
703 struct sched_entity *curr = cfs_rq->curr;
...
713
714 curr->exec_start = now;
...
719 curr->sum_exec_runtime += delta_exec;
...
722 curr->vruntime += calc_delta_fair(delta_exec, curr);
723 update_min_vruntime(cfs_rq);
...
733 account_cfs_rq_runtime(cfs_rq, delta_exec);
734 }
The actual update to min_vruntime happens in the aptly named update_min_vruntime() function:
457 static void update_min_vruntime(struct cfs_rq *cfs_rq)
458 {
459 u64 vruntime = cfs_rq->min_vruntime;
460
461 if (cfs_rq->curr)
462 vruntime = cfs_rq->curr->vruntime;
463
464 if (cfs_rq->rb_leftmost) {
465 struct sched_entity *se = rb_entry(cfs_rq->rb_leftmost,
466 struct sched_entity,
467 run_node);
468
469 if (!cfs_rq->curr)
470 vruntime = se->vruntime;
471 else
472 vruntime = min_vruntime(vruntime, se->vruntime);
473 }
474
475 /* ensure we never gain time by being placed backwards. */
476 cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
...
481 }
By ensuring that min_vruntime is properly updated, it follows that normalization based on min_vruntime stays consistent. (You can see more examples of where normalization based on min_vruntime occurs by grepping for "normalize" or "min_vruntime" in fair.c.)
So in simple terms, all CFS tasks' vruntime values are normalized based on the current min_vruntime, which ensures that in your example, the newer task's vruntime will rapidly approach equilibrium with the older task's vruntime. (We know this because the documentation states that min_vruntime is monotonically increasing.)
When a thread on Linux is spinning and trying to get the spinlock, Is there no chance this thread can be preempted?
EDIT:
I just want to make sure some thing. On a "UP" system, and there is no interrupt handler will access this spinlock. If the thread who is spinning and trying to get the spinlock can be preempted, I think in this case, the critical section which spinlock protects can call sleep, since the thread holding spinlock can be re-scheduled back to CPU.
No it cannot be preempted: see the code (taken from linux sources) http://lxr.free-electrons.com/source/include/linux/spinlock_api_smp.h?v=2.6.32#L241
241 static inline unsigned long __spin_lock_irqsave(spinlock_t *lock)
242 {
243 unsigned long flags;
244
245 local_irq_save(flags);
246 preempt_disable();
247 spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
248 /*
249 * On lockdep we dont want the hand-coded irq-enable of
250 * _raw_spin_lock_flags() code, because lockdep assumes
251 * that interrupts are not re-enabled during lock-acquire:
252 */
253 #ifdef CONFIG_LOCKDEP
254 LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
255 #else
256 _raw_spin_lock_flags(lock, &flags);
257 #endif
258 return flags;
259 }
260
[...]
349 static inline void __spin_unlock(spinlock_t *lock)
350 {
351 spin_release(&lock->dep_map, 1, _RET_IP_);
352 _raw_spin_unlock(lock);
353 preempt_enable();
354 }
see lines 246 and 353
By the way It is generally a bad idea to sleep while holding a lock (spinlock or not)
Assume we have a blank computer without any OS and we are installing a Linux. Where in the kernel is the code that identifies the processors and the cores and get information about/from them?
This info eventually shows up in places like /proc/cpuinfo but how does the kernel get it in the first place?!
Short answer
Kernel uses special CPU instruction cpuid and saves results in internal structure - cpuinfo_x86 for x86
Long answer
Kernel source is your best friend.
Start from entry point - file /proc/cpuinfo.
As any proc file it has to be cretaed somewhere in kernel and declared with some file_operations. This is done at fs/proc/cpuinfo.c. Interesting piece is seq_open that uses reference to some cpuinfo_op. This ops are declared in arch/x86/kernel/cpu/proc.c where we see some show_cpuinfo function. This function is in the same file on line 57.
Here you can see
64 seq_printf(m, "processor\t: %u\n"
65 "vendor_id\t: %s\n"
66 "cpu family\t: %d\n"
67 "model\t\t: %u\n"
68 "model name\t: %s\n",
69 cpu,
70 c->x86_vendor_id[0] ? c->x86_vendor_id : "unknown",
71 c->x86,
72 c->x86_model,
73 c->x86_model_id[0] ? c->x86_model_id : "unknown");
Structure c declared on the first line as struct cpuinfo_x86. This structure is declared in arch/x86/include/asm/processor.h. And if you search for references on that structure you will find function cpu_detect and that function calls function cpuid which is finally resolved to native_cpuid that looks like this:
189 static inline void native_cpuid(unsigned int *eax, unsigned int *ebx,
190 unsigned int *ecx, unsigned int *edx)
191 {
192 /* ecx is often an input as well as an output. */
193 asm volatile("cpuid"
194 : "=a" (*eax),
195 "=b" (*ebx),
196 "=c" (*ecx),
197 "=d" (*edx)
198 : "" (*eax), "2" (*ecx)
199 : "memory");
200 }
And here you see assembler instruction cpuid. And this little thing does real work.
This information from BIOS + Hardware DB. You can get info direct by dmidecode, for example (if you need more info - try to check dmidecode source code)
sudo dmidecode -t processor
I'm reading the source code of LDD3 Chapter 9. And there's an example for ISA driver named silly.
The following is initialization for the module. What I don't understand is why there's no call for "request_mem_region()" before invocation for ioremap() in line 282
268 int silly_init(void)
269 {
270 int result = register_chrdev(silly_major, "silly", &silly_fops);
271 if (result < 0) {
272 printk(KERN_INFO "silly: can't get major number\n");
273 return result;
274 }
275 if (silly_major == 0)
276 silly_major = result; /* dynamic */
277 /*
278 * Set up our I/O range.
279 */
280
281 /* this line appears in silly_init */
282 io_base = ioremap(ISA_BASE, ISA_MAX - ISA_BASE);
283 return 0;
284 }
This particular driver allows accesses to all the memory in the range 0xA0000..0x100000.
If there actually are any devices in this range, then it is likely that some other driver already has reserved some of that memory, so if silly were try to call request_mem_region, it would fail, or it would be necessary to unload that other driver before loading silly.
On a PC, this range contains memory of the graphics card, and the system BIOS:
$ cat /proc/iomem
...
000a0000-000bffff : PCI Bus 0000:00
000c0000-000cedff : Video ROM
000d0000-000dffff : PCI Bus 0000:00
000e4000-000fffff : reserved
000f0000-000fffff : System ROM
...
Unloading the graphics driver often is not possible (because it's not a module), and would prevent you from seeing what the silly driver does, and the ROM memory ranges are reserved by the kernel itself and cannot be freed.
TL;DR: Not calling request_mem_region is a particular quirk of the silly driver.
Any 'real' driver would be required to call it.
Does anyone have any good explanations, tutorials, books, or guides on the use of PTRACE_SYSEMU?
What I found interesting:
Example Implementation for ptrace
Playing with ptrace, Part I - LinuxJournal.com
Playing with ptrace, Part II - LinuxJournal.com
And programming library that makes using ptrace easier :
PinkTrace - ptrace() wrapper library.
For pinktrace there are examples, sydbox sources are example of complex pinktrace usecase. In general, I've found author as good person to contact about using and testing pinktrace.
There is small test from linux kernel sources which uses PTRACE_SYSEMU:
http://code.metager.de/source/xref/linux/stable/tools/testing/selftests/x86/ptrace_syscall.c
or http://lxr.free-electrons.com/source/tools/testing/selftests/x86/ptrace_syscall.c
186 struct user_regs_struct regs;
187
188 printf("[RUN]\tSYSEMU\n");
189 if (ptrace(PTRACE_SYSEMU, chld, 0, 0) != 0)
190 err(1, "PTRACE_SYSCALL");
191 wait_trap(chld);
192
193 if (ptrace(PTRACE_GETREGS, chld, 0, ®s) != 0)
194 err(1, "PTRACE_GETREGS");
195
196 if (regs.user_syscall_nr != SYS_gettid ||
197 regs.user_arg0 != 10 || regs.user_arg1 != 11 ||
198 regs.user_arg2 != 12 || regs.user_arg3 != 13 ||
199 regs.user_arg4 != 14 || regs.user_arg5 != 15) {
200 printf("[FAIL]\tInitial args are wrong (nr=%lu, args=%lu %lu %lu %lu %lu %lu)\n", (unsigned long)regs.user_syscall_nr, (unsigned long)regs.user_arg0, (unsigned long)regs.user_arg1, (unsigned long)regs.user_arg2, (unsigned long)regs.user_arg3, (unsigned long)regs.user_arg4, (unsigned long)regs.user_arg5);
201 nerrs++;
202 } else {
203 printf("[OK]\tInitial nr and args are correct\n");
204 }
205
206 printf("[RUN]\tRestart the syscall (ip = 0x%lx)\n",
207 (unsigned long)regs.user_ip);
208
209 /*
210 * This does exactly what it appears to do if syscall is int80 or
211 * SYSCALL64. For SYSCALL32 or SYSENTER, though, this is highly
212 * magical. It needs to work so that ptrace and syscall restart
213 * work as expected.
214 */
215 regs.user_ax = regs.user_syscall_nr;
216 regs.user_ip -= 2;
217 if (ptrace(PTRACE_SETREGS, chld, 0, ®s) != 0)
218 err(1, "PTRACE_SETREGS");
219
220 if (ptrace(PTRACE_SYSEMU, chld, 0, 0) != 0)
221 err(1, "PTRACE_SYSCALL");
222 wait_trap(chld);
223
224 if (ptrace(PTRACE_GETREGS, chld, 0, ®s) != 0)
225 err(1, "PTRACE_GETREGS");
226
So, it looks like just another ptrace call which will allow program to run until next system call is made by it; then stop child and signal the ptracer. It can read registers, optionally change some and restart the syscall.
Implemented in http://lxr.free-electrons.com/source/kernel/ptrace.c?v=4.10#L1039 like other stepping ptrace calls:
1039 #ifdef PTRACE_SINGLESTEP
1040 case PTRACE_SINGLESTEP:
1041 #endif
1042 #ifdef PTRACE_SINGLEBLOCK
1043 case PTRACE_SINGLEBLOCK:
1044 #endif
1045 #ifdef PTRACE_SYSEMU
1046 case PTRACE_SYSEMU:
1047 case PTRACE_SYSEMU_SINGLESTEP:
1048 #endif
1049 case PTRACE_SYSCALL:
1050 case PTRACE_CONT:
1051 return ptrace_resume(child, request, data);
And man page has some info: http://man7.org/linux/man-pages/man2/ptrace.2.html
PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP (since Linux 2.6.14)
For PTRACE_SYSEMU, continue and stop on entry to the next
system call, which will not be executed. See the
documentation on syscall-stops below. For
PTRACE_SYSEMU_SINGLESTEP, do the same but also singlestep if
not a system call. This call is used by programs like User
Mode Linux that want to emulate all the tracee's system calls.
The data argument is treated as for PTRACE_CONT. The addr
argument is ignored. These requests are currently supported
only on x86.
So, it is not portable and used only for Usermode linux (um) on x86 platform as variant of classic PTRACE_SYSCALL. And um test for sysemu with some comments is here: http://lxr.free-electrons.com/source/arch/um/os-Linux/start_up.c?v=4.10#L155
155 __uml_setup("nosysemu", nosysemu_cmd_param,
156 "nosysemu\n"
157 " Turns off syscall emulation patch for ptrace (SYSEMU) on.\n"
158 " SYSEMU is a performance-patch introduced by Laurent Vivier. It changes\n"
159 " behaviour of ptrace() and helps reducing host context switch rate.\n"
160 " To make it working, you need a kernel patch for your host, too.\n"
161 " See http://perso.wanadoo.fr/laurent.vivier/UML/ for further \n"
162 " information.\n\n");
163
164 static void __init check_sysemu(void)
Link in comment was redirecting to secret site http://sysemu.sourceforge.net/ from 2004:
Why ?
UML uses ptrace() and PTRACE_SYSCALL to catch system calls. But, by
this way, you can't remove the real system call, only monitor it. UML,
to avoid the real syscall and emulate it, replaces the real syscall by
a call to getpid(). This method generates two context-switches instead
of one. A Solution
A solution is to change the behaviour of ptrace() to not call the real
syscall and thus we don't have to replace it by a call to getpid().
How ?
By adding a new command to ptrace(), PTRACE_SYSEMU, that acts like
PTRACE_SYSCALL without executing syscall. To add this command we need
to patch the host kernel. To use this new command in UML kernel, we
need to patch the UML kernel too.