How does the Linux kernel get info about the processors and the cores? - linux

Assume we have a blank computer without any OS and we are installing a Linux. Where in the kernel is the code that identifies the processors and the cores and get information about/from them?
This info eventually shows up in places like /proc/cpuinfo but how does the kernel get it in the first place?!

Short answer
Kernel uses special CPU instruction cpuid and saves results in internal structure - cpuinfo_x86 for x86
Long answer
Kernel source is your best friend.
Start from entry point - file /proc/cpuinfo.
As any proc file it has to be cretaed somewhere in kernel and declared with some file_operations. This is done at fs/proc/cpuinfo.c. Interesting piece is seq_open that uses reference to some cpuinfo_op. This ops are declared in arch/x86/kernel/cpu/proc.c where we see some show_cpuinfo function. This function is in the same file on line 57.
Here you can see
64 seq_printf(m, "processor\t: %u\n"
65 "vendor_id\t: %s\n"
66 "cpu family\t: %d\n"
67 "model\t\t: %u\n"
68 "model name\t: %s\n",
69 cpu,
70 c->x86_vendor_id[0] ? c->x86_vendor_id : "unknown",
71 c->x86,
72 c->x86_model,
73 c->x86_model_id[0] ? c->x86_model_id : "unknown");
Structure c declared on the first line as struct cpuinfo_x86. This structure is declared in arch/x86/include/asm/processor.h. And if you search for references on that structure you will find function cpu_detect and that function calls function cpuid which is finally resolved to native_cpuid that looks like this:
189 static inline void native_cpuid(unsigned int *eax, unsigned int *ebx,
190 unsigned int *ecx, unsigned int *edx)
191 {
192 /* ecx is often an input as well as an output. */
193 asm volatile("cpuid"
194 : "=a" (*eax),
195 "=b" (*ebx),
196 "=c" (*ecx),
197 "=d" (*edx)
198 : "" (*eax), "2" (*ecx)
199 : "memory");
200 }
And here you see assembler instruction cpuid. And this little thing does real work.

This information from BIOS + Hardware DB. You can get info direct by dmidecode, for example (if you need more info - try to check dmidecode source code)
sudo dmidecode -t processor

Related

Where and When Linux Kernel Setup GDT?

I have some doubt regarding GDT in linux. I try to get GDT info in kernel space (Ring0), and my test code called in system call context. In the test code, I try to print ss register (Segment Selector), and get ss segment descriptor by GDTR and ss-segment-selector.
77 void printGDTInfo(void) {
78 struct desc_ptr pgdt, *pss_desc;
79 unsigned long ssr;
80 struct desc_struct *ss_desc;
81
82 // Get GDTR
83 native_store_gdt(&pgdt);
84 unsigned long gdt_addr = pgdt.address;
85 unsigned long gdt_size = pgdt.size;
86 printk("[GDT] Addr:%lu |Size:%lu\n", gdt_addr, gdt_size);
87
88 // Get SS Register
89 asm("mov %%ss, %%eax"
90 :"=a"(ssr));
91 printk("SSR In Kernel:%lu\n", ssr);
92 unsigned long desc_index = ssr >> 3; // SHIFT for Descriptor Index
93 printk("SSR Shift:%lu\n", desc_index);
94 ss_desc = (struct desc_struct*)(gdt_addr + desc_index * sizeof(struct desc_struct));
95 printk("SSR:Base0:%lu, Base1:%lu,Base2:%lu\n", ss_desc->base0, ss_desc->base1, ss_desc->base2);
96 }
What confused me most is the "base" fields in ss-descriptor are all zero (line95 print). I try to print __USER_DS segment descriptor, the "base" fields are also zero.
Is that true? All the segment in Linux use same Base Address(zero)?
I want to check the GDT initialization in Linux Source Code but I am not sure where and when Linux setup GDT?
I find codes in "arch/x86/kernel/cpu/common.c"like this, the second parameter(zero) of GDT_ENTRY_INIT is zero, which means the base0/base1/base2 fields in segment descriptor are all zero.
125 [GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
126 [GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
127 [GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
If that is true, all the segment has same base address(zero). As a result, same virtual address in Ring0 and Ring1 will mapping to the same linear address?
I am appreciate for your help :)

How to choose static IO memory map for virtual memory on ARM

I am investigating how to port the Linux kernel to a new ARM platform.
I noticed that some platform implementations have static mapping from a physical IO address to a virtual address in map_io function.
My question is how should I decide the "virtual" address in structure map_desc? Can I map a physical IO to arbitrary virtual memory? or are there some rules or good practice about it? I checked http://lxr.free-electrons.com/source/Documentation/arm/memory.txt , but did not find any answers.
Here are some examples of map_desc and map_io:
http://lxr.free-electrons.com/source/arch/arm/mach-versatile/versatile_dt.c#L45
44 DT_MACHINE_START(VERSATILE_PB, "ARM-Versatile (Device Tree Support)")
45 .map_io = versatile_map_io,
46 .init_early = versatile_init_early,
47 .init_machine = versatile_dt_init,
48 .dt_compat = versatile_dt_match,
49 .restart = versatile_restart,
50 MACHINE_END
http://lxr.free-electrons.com/source/arch/arm/mach-versatile/core.c#L189
189 void __init versatile_map_io(void)
190 {
191 iotable_init(versatile_io_desc, ARRAY_SIZE(versatile_io_desc));
192 }
131 static struct map_desc versatile_io_desc[] __initdata __maybe_unused = {
132 {
133 .virtual = IO_ADDRESS(VERSATILE_SYS_BASE),
134 .pfn = __phys_to_pfn(VERSATILE_SYS_BASE),
135 .length = SZ_4K,
136 .type = MT_DEVICE
137 }, {
too long for a comment...
Not an expert but, since map_desc is for static mappings. It should be from system manuals. virtual is how the peripheral can be accessed from kernel virtual space, pfn (page frame number) is physical address by page units.
Thing is if you are in kernel space, you are using kernel virtual space mappings so even you want to access a certain physical address you need to have a mapping for that which can be one-to-one which I believe what you get out of map_desc.
Static mapping is map_desc, dynamic mapping is ioremap. So if you want to play with physical IO, ioremap is first thing if that doesn't work then for special cases map_desc.
DMA-API-HOWTO provides a good entry point to different kind of addresses mappings in Linux.

Race condition on ticket-based ARM spinlock

I found that spinlocks in Linux kernel are all using "ticket-based" spinlock now. However after looking at the ARM implementation of it, I'm confused because the "load-add-store" operation is not atomic at all. Please see the code below:
74 static inline void arch_spin_lock(arch_spinlock_t *lock)
75 {
76 unsigned long tmp;
77 u32 newval;
78 arch_spinlock_t lockval;
79
80 __asm__ __volatile__(
81 "1: ldrex %0, [%3]\n" /*Why this load-add-store is not atomic?*/
82 " add %1, %0, %4\n"
83 " strex %2, %1, [%3]\n"
84 " teq %2, #0\n"
85 " bne 1b"
86 : "=&r" (lockval), "=&r" (newval), "=&r" (tmp)
87 : "r" (&lock->slock), "I" (1 << TICKET_SHIFT)
88 : "cc");
89
90 while (lockval.tickets.next != lockval.tickets.owner) {
91 wfe();
92 lockval.tickets.owner = ACCESS_ONCE(lock->tickets.owner);
93 }
94
95 smp_mb();
96 }
As you can see, on line 81~83 it loads lock->slock to "lockval" and increment it by one and then store it back to the lock->slock.
However I didn't see anywhere this is ensured to be atomic. So it could be possible that:
Two users on different cpu are reading lock->slock to their own variable "lockval" at the same time; Then they add "lockval" by one respectively and then store it back.
This will cause these two users are having the same "number" in hand and once the "owner" field becomes that number, both of them will acquire the lock and do operations on some shared-resources!
I don't think kernel can have such a bug in spinlock. Am I wrong somewhere?
STREX is a conditional store, this code has Load Link-Store Conditional semantics, even if ARM doesn't use that name.
The operation either completes atomically, or fails.
The assembler block tests for failure (the tmp variable indicates failure) and reattempts the modification, using the new value (updated by another core).

Why ISA doesn't need request_mem_region

I'm reading the source code of LDD3 Chapter 9. And there's an example for ISA driver named silly.
The following is initialization for the module. What I don't understand is why there's no call for "request_mem_region()" before invocation for ioremap() in line 282
268 int silly_init(void)
269 {
270 int result = register_chrdev(silly_major, "silly", &silly_fops);
271 if (result < 0) {
272 printk(KERN_INFO "silly: can't get major number\n");
273 return result;
274 }
275 if (silly_major == 0)
276 silly_major = result; /* dynamic */
277 /*
278 * Set up our I/O range.
279 */
280
281 /* this line appears in silly_init */
282 io_base = ioremap(ISA_BASE, ISA_MAX - ISA_BASE);
283 return 0;
284 }
This particular driver allows accesses to all the memory in the range 0xA0000..0x100000.
If there actually are any devices in this range, then it is likely that some other driver already has reserved some of that memory, so if silly were try to call request_mem_region, it would fail, or it would be necessary to unload that other driver before loading silly.
On a PC, this range contains memory of the graphics card, and the system BIOS:
$ cat /proc/iomem
...
000a0000-000bffff : PCI Bus 0000:00
000c0000-000cedff : Video ROM
000d0000-000dffff : PCI Bus 0000:00
000e4000-000fffff : reserved
000f0000-000fffff : System ROM
...
Unloading the graphics driver often is not possible (because it's not a module), and would prevent you from seeing what the silly driver does, and the ROM memory ranges are reserved by the kernel itself and cannot be freed.
TL;DR: Not calling request_mem_region is a particular quirk of the silly driver.
Any 'real' driver would be required to call it.

Any good guides on using PTRACE_SYSEMU?

Does anyone have any good explanations, tutorials, books, or guides on the use of PTRACE_SYSEMU?
What I found interesting:
Example Implementation for ptrace
Playing with ptrace, Part I - LinuxJournal.com
Playing with ptrace, Part II - LinuxJournal.com
And programming library that makes using ptrace easier :
PinkTrace - ptrace() wrapper library.
For pinktrace there are examples, sydbox sources are example of complex pinktrace usecase. In general, I've found author as good person to contact about using and testing pinktrace.
There is small test from linux kernel sources which uses PTRACE_SYSEMU:
http://code.metager.de/source/xref/linux/stable/tools/testing/selftests/x86/ptrace_syscall.c
or http://lxr.free-electrons.com/source/tools/testing/selftests/x86/ptrace_syscall.c
186 struct user_regs_struct regs;
187
188 printf("[RUN]\tSYSEMU\n");
189 if (ptrace(PTRACE_SYSEMU, chld, 0, 0) != 0)
190 err(1, "PTRACE_SYSCALL");
191 wait_trap(chld);
192
193 if (ptrace(PTRACE_GETREGS, chld, 0, &regs) != 0)
194 err(1, "PTRACE_GETREGS");
195
196 if (regs.user_syscall_nr != SYS_gettid ||
197 regs.user_arg0 != 10 || regs.user_arg1 != 11 ||
198 regs.user_arg2 != 12 || regs.user_arg3 != 13 ||
199 regs.user_arg4 != 14 || regs.user_arg5 != 15) {
200 printf("[FAIL]\tInitial args are wrong (nr=%lu, args=%lu %lu %lu %lu %lu %lu)\n", (unsigned long)regs.user_syscall_nr, (unsigned long)regs.user_arg0, (unsigned long)regs.user_arg1, (unsigned long)regs.user_arg2, (unsigned long)regs.user_arg3, (unsigned long)regs.user_arg4, (unsigned long)regs.user_arg5);
201 nerrs++;
202 } else {
203 printf("[OK]\tInitial nr and args are correct\n");
204 }
205
206 printf("[RUN]\tRestart the syscall (ip = 0x%lx)\n",
207 (unsigned long)regs.user_ip);
208
209 /*
210 * This does exactly what it appears to do if syscall is int80 or
211 * SYSCALL64. For SYSCALL32 or SYSENTER, though, this is highly
212 * magical. It needs to work so that ptrace and syscall restart
213 * work as expected.
214 */
215 regs.user_ax = regs.user_syscall_nr;
216 regs.user_ip -= 2;
217 if (ptrace(PTRACE_SETREGS, chld, 0, &regs) != 0)
218 err(1, "PTRACE_SETREGS");
219
220 if (ptrace(PTRACE_SYSEMU, chld, 0, 0) != 0)
221 err(1, "PTRACE_SYSCALL");
222 wait_trap(chld);
223
224 if (ptrace(PTRACE_GETREGS, chld, 0, &regs) != 0)
225 err(1, "PTRACE_GETREGS");
226
So, it looks like just another ptrace call which will allow program to run until next system call is made by it; then stop child and signal the ptracer. It can read registers, optionally change some and restart the syscall.
Implemented in http://lxr.free-electrons.com/source/kernel/ptrace.c?v=4.10#L1039 like other stepping ptrace calls:
1039 #ifdef PTRACE_SINGLESTEP
1040 case PTRACE_SINGLESTEP:
1041 #endif
1042 #ifdef PTRACE_SINGLEBLOCK
1043 case PTRACE_SINGLEBLOCK:
1044 #endif
1045 #ifdef PTRACE_SYSEMU
1046 case PTRACE_SYSEMU:
1047 case PTRACE_SYSEMU_SINGLESTEP:
1048 #endif
1049 case PTRACE_SYSCALL:
1050 case PTRACE_CONT:
1051 return ptrace_resume(child, request, data);
And man page has some info: http://man7.org/linux/man-pages/man2/ptrace.2.html
PTRACE_SYSEMU, PTRACE_SYSEMU_SINGLESTEP (since Linux 2.6.14)
For PTRACE_SYSEMU, continue and stop on entry to the next
system call, which will not be executed. See the
documentation on syscall-stops below. For
PTRACE_SYSEMU_SINGLESTEP, do the same but also singlestep if
not a system call. This call is used by programs like User
Mode Linux that want to emulate all the tracee's system calls.
The data argument is treated as for PTRACE_CONT. The addr
argument is ignored. These requests are currently supported
only on x86.
So, it is not portable and used only for Usermode linux (um) on x86 platform as variant of classic PTRACE_SYSCALL. And um test for sysemu with some comments is here: http://lxr.free-electrons.com/source/arch/um/os-Linux/start_up.c?v=4.10#L155
155 __uml_setup("nosysemu", nosysemu_cmd_param,
156 "nosysemu\n"
157 " Turns off syscall emulation patch for ptrace (SYSEMU) on.\n"
158 " SYSEMU is a performance-patch introduced by Laurent Vivier. It changes\n"
159 " behaviour of ptrace() and helps reducing host context switch rate.\n"
160 " To make it working, you need a kernel patch for your host, too.\n"
161 " See http://perso.wanadoo.fr/laurent.vivier/UML/ for further \n"
162 " information.\n\n");
163
164 static void __init check_sysemu(void)
Link in comment was redirecting to secret site http://sysemu.sourceforge.net/ from 2004:
Why ?
UML uses ptrace() and PTRACE_SYSCALL to catch system calls. But, by
this way, you can't remove the real system call, only monitor it. UML,
to avoid the real syscall and emulate it, replaces the real syscall by
a call to getpid(). This method generates two context-switches instead
of one. A Solution
A solution is to change the behaviour of ptrace() to not call the real
syscall and thus we don't have to replace it by a call to getpid().
How ?
By adding a new command to ptrace(), PTRACE_SYSEMU, that acts like
PTRACE_SYSCALL without executing syscall. To add this command we need
to patch the host kernel. To use this new command in UML kernel, we
need to patch the UML kernel too.

Resources