Where and When Linux Kernel Setup GDT? - linux

I have some doubt regarding GDT in linux. I try to get GDT info in kernel space (Ring0), and my test code called in system call context. In the test code, I try to print ss register (Segment Selector), and get ss segment descriptor by GDTR and ss-segment-selector.
77 void printGDTInfo(void) {
78 struct desc_ptr pgdt, *pss_desc;
79 unsigned long ssr;
80 struct desc_struct *ss_desc;
81
82 // Get GDTR
83 native_store_gdt(&pgdt);
84 unsigned long gdt_addr = pgdt.address;
85 unsigned long gdt_size = pgdt.size;
86 printk("[GDT] Addr:%lu |Size:%lu\n", gdt_addr, gdt_size);
87
88 // Get SS Register
89 asm("mov %%ss, %%eax"
90 :"=a"(ssr));
91 printk("SSR In Kernel:%lu\n", ssr);
92 unsigned long desc_index = ssr >> 3; // SHIFT for Descriptor Index
93 printk("SSR Shift:%lu\n", desc_index);
94 ss_desc = (struct desc_struct*)(gdt_addr + desc_index * sizeof(struct desc_struct));
95 printk("SSR:Base0:%lu, Base1:%lu,Base2:%lu\n", ss_desc->base0, ss_desc->base1, ss_desc->base2);
96 }
What confused me most is the "base" fields in ss-descriptor are all zero (line95 print). I try to print __USER_DS segment descriptor, the "base" fields are also zero.
Is that true? All the segment in Linux use same Base Address(zero)?
I want to check the GDT initialization in Linux Source Code but I am not sure where and when Linux setup GDT?
I find codes in "arch/x86/kernel/cpu/common.c"like this, the second parameter(zero) of GDT_ENTRY_INIT is zero, which means the base0/base1/base2 fields in segment descriptor are all zero.
125 [GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
126 [GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
127 [GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
If that is true, all the segment has same base address(zero). As a result, same virtual address in Ring0 and Ring1 will mapping to the same linear address?
I am appreciate for your help :)

Related

Porting package to NodeJS and having trouble with Buffer.alloc()

I'm porting a game server to NodeJS. The problem is that I'm not familiar at all with packets and building them out. After reading through the NodeJS docs, I think that I've structured my response the way that the client expects, but I'm the client doesn't seem to respond well to what my server is sending.
Hoping someone can confirm that it's something in my server response that doesn't match the documentation requirements..
The client expects the following response from the TCP socket:
Packet Build
BYTE[1] cmd (0xA8)
BYTE[2] total length of this packet
BYTE[1] System Info Flag (0x5D)
BYTE[2] # of servers
(Repeat as needed for each server)
BYTE[2] server index (0-based)
BYTE[32] server name
BYTE percent full
BYTE timezone
BYTE[4] server IP to ping
Here is my NodeJS interpretation of the docs.
/** Build response header */
const length = 45
serverResponse = Buffer.alloc(length)
serverResponse.fill(0xA8, 0)
serverResponse.fill(Buffer.alloc(2, length), 1)
// Last fill had a buffer size of 2, so our next offset considers that
serverResponse.fill(0x5d,3)
serverResponse.fill(Buffer.alloc(2, 1),4)
/** Build response server list */
serverResponse.fill(Buffer.alloc(2, 0),5) /* 2 Bytes (server index, 0-based) */
// Last fill had a buffer size of 2, so our next offset considers that
serverResponse.fill(
Buffer.alloc(32, Buffer.from('Heres your server')),
7) /* 32 bytes (Server name) */
serverResponse.fill(Buffer.alloc(1, 9), 39) /* 1 Bytes (% Full) */
/**
* Trying -12 - 12 range divided by (60 * 60)). #see
* https://github.com/Sphereserver/Source/blob/0be2bc1d2e16659239460495b9819eb8dcfd39ed/src/graysvr/CServRef.cpp#L42
*/
serverResponse.fill(Buffer.alloc(1, -5 / (60 *60)),40) /* 1 Bytes (Timezone */
serverResponse.fill(Buffer.from([0,0,0,0]), 41) /** IP Address */
This outputs:
<Buffer a8 2d 2d 5d 01 00 00 48 65 72 65 73 20 79 6f 75 72 20 73 65 72 76 65 72 48 65 72 65 73 20 79 6f 75 72 20 73 65 72 76 09 00 00 00 00 00>
Edit: I've discovered the following that may help me along.
NodeJS equivalents
/**
* BYTE 8-bit unsigned buf.writeUInt8()
* SBYTE 8-bit signed buf.writeInt8()
* BOOL 8-bit boolean (0x00=False, 0xFF=True) buf.fill(0x00) || buf.fill(0xFF)
* CHAR 8-bit single ASCII character buf.from('Text', 'ascii') - Make 8-bit?
* UNI 16-bit single unicode character buf.from('A', 'utf16le') - Correct?
* SHORT 16-bit signed buf.writeInt16BE() - #see https://www.reddit.com/r/node/comments/9hob2u/buffer_endianness_little_endian_or_big_endian_how/
* USHORT 16-bit unsigned buf.writeUInt16BE() - #see https://www.reddit.com/r/node/comments/9hob2u/buffer_endianness_little_endian_or_big_endian_how/
* INT 32-bit signed buf.writeInt32BE - #see https://www.reddit.com/r/node/comments/9hob2u/buffer_endianness_little_endian_or_big_endian_how/
* UINT 32-bit unsigned buf.writeUInt32BE - #see https://www.reddit.com/r/node/comments/9hob2u/buffer_endianness_little_endian_or_big_endian_how/
*/

How does the Linux kernel get info about the processors and the cores?

Assume we have a blank computer without any OS and we are installing a Linux. Where in the kernel is the code that identifies the processors and the cores and get information about/from them?
This info eventually shows up in places like /proc/cpuinfo but how does the kernel get it in the first place?!
Short answer
Kernel uses special CPU instruction cpuid and saves results in internal structure - cpuinfo_x86 for x86
Long answer
Kernel source is your best friend.
Start from entry point - file /proc/cpuinfo.
As any proc file it has to be cretaed somewhere in kernel and declared with some file_operations. This is done at fs/proc/cpuinfo.c. Interesting piece is seq_open that uses reference to some cpuinfo_op. This ops are declared in arch/x86/kernel/cpu/proc.c where we see some show_cpuinfo function. This function is in the same file on line 57.
Here you can see
64 seq_printf(m, "processor\t: %u\n"
65 "vendor_id\t: %s\n"
66 "cpu family\t: %d\n"
67 "model\t\t: %u\n"
68 "model name\t: %s\n",
69 cpu,
70 c->x86_vendor_id[0] ? c->x86_vendor_id : "unknown",
71 c->x86,
72 c->x86_model,
73 c->x86_model_id[0] ? c->x86_model_id : "unknown");
Structure c declared on the first line as struct cpuinfo_x86. This structure is declared in arch/x86/include/asm/processor.h. And if you search for references on that structure you will find function cpu_detect and that function calls function cpuid which is finally resolved to native_cpuid that looks like this:
189 static inline void native_cpuid(unsigned int *eax, unsigned int *ebx,
190 unsigned int *ecx, unsigned int *edx)
191 {
192 /* ecx is often an input as well as an output. */
193 asm volatile("cpuid"
194 : "=a" (*eax),
195 "=b" (*ebx),
196 "=c" (*ecx),
197 "=d" (*edx)
198 : "" (*eax), "2" (*ecx)
199 : "memory");
200 }
And here you see assembler instruction cpuid. And this little thing does real work.
This information from BIOS + Hardware DB. You can get info direct by dmidecode, for example (if you need more info - try to check dmidecode source code)
sudo dmidecode -t processor

PCI driver to fetch MAC address

I was trying to write a pci driver which can display the MAC address of my Ethernet card.
Running a Ubuntu on VM and my Ethernet card is Intel one as follows
00:08.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)
I was able to get the data sheet of the same from Intel website and as per data sheet it says IO address are mapped to Bar 2 (Refer to pg 87) and MAC can be read using RAL/RAH register which are at offset RAL (05400h + 8*n; R/W) and RAH (05404h + 8n; R/W)
2 18h IO Register Base Address (bits 31:2) 0b mem
Based on this information, i wrote a small PCI driver but i always get the MAC as fff and when i debugged further, i see io_base address is always zero.
Below is the code
1 /*
2 Program to find a device on the PCI sub-system
3 */
4 #define VENDOR_ID 0x8086
5 #define DEVICE_ID 0x100e
6
7 #include <linux/kernel.h>
8 #include <linux/module.h>
9 #include <linux/stddef.h>
10 #include <linux/pci.h>
11 #include <linux/init.h>
12 #include <linux/cdev.h>
13 #include <linux/device.h>
14 #include <asm/io.h>
15
16 #define LOG(string...) printk(KERN_INFO string)
17
18 #define CDEV_MAJOR 227
19 #define CDEV_MINOR 0
20
21
22 MODULE_LICENSE("GPL");
23
24 struct pci_dev *pci_dev;
25 unsigned long mmio_addr;
26 unsigned long reg_len;
27 unsigned long *base_addr;
28
29 int device_probe(struct pci_dev *dev, const struct pci_device_id *id);
30 void device_remove(struct pci_dev *dev);
31
32 struct pci_device_id pci_device_id_DevicePCI[] =
33 {
34 {VENDOR_ID, DEVICE_ID, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0},
35 };
36
37 struct pci_driver pci_driver_DevicePCI =
38 {
39 name: "MyPCIDevice",
40 id_table: pci_device_id_DevicePCI,
41 probe: device_probe,
42 remove: device_remove
43 };
44
45
46 int init_module(void)
47 {
48 //struct pci_dev *pdev = NULL;
49 int ret = 0;
50
51 pci_register_driver(&pci_driver_DevicePCI);
52
53 return ret;
54 }
55
56 void cleanup_module(void)
57 {
58 pci_unregister_driver(&pci_driver_DevicePCI);
59
60 }
61
62 #define REGISTER_OFFSET 0x05400
64 int device_probe(struct pci_dev *dev, const struct pci_device_id *id)
65 {
66 int ret;
67 int bar = 2; // Bar to be reserved
68 unsigned long io_base = 0;
69 unsigned long mem_len = 0;
70 unsigned int register_data = 0;
71
72 LOG("Device probed");
73
74 /* Reserve the access to PCI device */
75 ret = pci_request_region(dev, bar, "my_pci");
76 if (ret) {
77 printk(KERN_ERR "request region failed :%d\n", ret);
78 return ret;
79 }
80
81 ret = pci_enable_device(dev);
82 if (ret < 0 ) LOG("Failed while enabling ... ");
83
84 io_base = pci_resource_start(dev, bar);
85 mem_len = pci_resource_len(dev, bar);
86
87 request_region(io_base, mem_len, "my_pci");
88 register_data = inw(io_base + REGISTER_OFFSET);
89 printk(KERN_INFO "IO base = %lx", io_base);
90 printk(KERN_INFO "MAC = %x", register_data);
91
92 return ret;
93 }
94
95 void device_remove(struct pci_dev *dev)
96 {
97 pci_release_regions(dev);
98 pci_disable_device(dev);
99 }
100
lspci -x output of my card
00:08.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)
00: 86 80 0e 10 07 00 30 02 02 00 00 02 00 40 00 00
10: 00 00 82 f0 00 00 00 00 41 d2 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 1e 00
30: 00 00 00 00 dc 00 00 00 00 00 00 00 09 01 ff 00
Can any one let me know what am i doing wrong?
I've modified your code and commented on changes. I have removed all of your existing comments to avoid confusion, and have only modified your probe function.
/* We need a place to store a logical address for unmapping later */
static void* logical_address;
int device_probe(struct pci_dev *dev, const struct pci_device_id *id)
{
int ret;
int bar_mask; /* BAR mask (this variable) and the integer BAR */
int requested_bar = 2; /* (this variable) are not the same thing, so give them */
/* separate variables */
resource_size_t io_base = 0; /* use kernel macros instead of built-in datatypes */
resource_size_t mem_len = 0;
unsigned int register_data = 0;
LOG("Device probed");
/* add this call to get the correct BAR mask */
bar_mask = pci_select_bars(dev, 0);
/* switched order - enable device before requesting memory */
ret = pci_enable_device(dev);
if (ret < 0 ) LOG("Failed while enabling ... ");
/* for this call, we want to pass the BAR mask, NOT the integer bar we want */
ret = pci_request_region(dev, bar_mask, "my_pci");
if (ret) {
printk(KERN_ERR "request region failed :%d\n", ret);
return ret;
}
/* it is in THESE calls that we request a specific BAR */
io_base = pci_resource_start(dev, requested_bar);
mem_len = pci_resource_len(dev, requested_bar);
/* you don't need to request anything again, so get rid of this line: */
/* request_region(io_base, mem_len, "my_pci"); */
/* you're missing an important step: we need to translate the IO address
* to a kernel logical address that we can actually use. Add a call to
* ioremap()
*/
logical_address = ioremap(io_base, mem_len);
/* we need to use the logical address returned by ioremap(), not the physical
* address returned by resource_start
*/
register_data = inw(logical_address + REGISTER_OFFSET);
printk(KERN_INFO "IO base = %lx", io_base);
printk(KERN_INFO "MAC = %x", register_data);
return ret;
}
You will need to add a corresponding call to iounmap() in your device_remove() routine. Take a look at the Intel E100E driver source code for some good examples.

Race condition on ticket-based ARM spinlock

I found that spinlocks in Linux kernel are all using "ticket-based" spinlock now. However after looking at the ARM implementation of it, I'm confused because the "load-add-store" operation is not atomic at all. Please see the code below:
74 static inline void arch_spin_lock(arch_spinlock_t *lock)
75 {
76 unsigned long tmp;
77 u32 newval;
78 arch_spinlock_t lockval;
79
80 __asm__ __volatile__(
81 "1: ldrex %0, [%3]\n" /*Why this load-add-store is not atomic?*/
82 " add %1, %0, %4\n"
83 " strex %2, %1, [%3]\n"
84 " teq %2, #0\n"
85 " bne 1b"
86 : "=&r" (lockval), "=&r" (newval), "=&r" (tmp)
87 : "r" (&lock->slock), "I" (1 << TICKET_SHIFT)
88 : "cc");
89
90 while (lockval.tickets.next != lockval.tickets.owner) {
91 wfe();
92 lockval.tickets.owner = ACCESS_ONCE(lock->tickets.owner);
93 }
94
95 smp_mb();
96 }
As you can see, on line 81~83 it loads lock->slock to "lockval" and increment it by one and then store it back to the lock->slock.
However I didn't see anywhere this is ensured to be atomic. So it could be possible that:
Two users on different cpu are reading lock->slock to their own variable "lockval" at the same time; Then they add "lockval" by one respectively and then store it back.
This will cause these two users are having the same "number" in hand and once the "owner" field becomes that number, both of them will acquire the lock and do operations on some shared-resources!
I don't think kernel can have such a bug in spinlock. Am I wrong somewhere?
STREX is a conditional store, this code has Load Link-Store Conditional semantics, even if ARM doesn't use that name.
The operation either completes atomically, or fails.
The assembler block tests for failure (the tmp variable indicates failure) and reattempts the modification, using the new value (updated by another core).

linux kernel ip_options_build() function

Below is the ip_options_build() in linux kernel 3.4, line 51 and 52:
51 if (opt->srr)
52 memcpy(iph+opt->srr+iph[opt->srr+1]-4, &daddr, 4);
I understand that the two lines say, if source routing option is present, copy the destination address to the end of the option, that suggests that iph[opt->srr+1] is the length of the source routing option, but I don't get it why?
31/*
32 * Write options to IP header, record destination address to
33 * source route option, address of outgoing interface
34 * (we should already know it, so that this function is allowed be
35 * called only after routing decision) and timestamp,
36 * if we originate this datagram.
37 *
38 * daddr is real destination address, next hop is recorded in IP header.
39 * saddr is address of outgoing interface.
40 */
41
42void ip_options_build(struct sk_buff *skb, struct ip_options *opt,
43 __be32 daddr, struct rtable *rt, int is_frag)
44{
45 unsigned char *iph = skb_network_header(skb);
46
47 memcpy(&(IPCB(skb)->opt), opt, sizeof(struct ip_options));
48 memcpy(iph+sizeof(struct iphdr), opt->__data, opt->optlen);
49 opt = &(IPCB(skb)->opt);
50
51 if (opt->srr)
52 memcpy(iph+opt->srr+iph[opt->srr+1]-4, &daddr, 4);
53
54 if (!is_frag) {
55 if (opt->rr_needaddr)
56 ip_rt_get_source(iph+opt->rr+iph[opt->rr+2]-5, skb, rt);
57 if (opt->ts_needaddr)
58 ip_rt_get_source(iph+opt->ts+iph[opt->ts+2]-9, skb, rt);
59 if (opt->ts_needtime) {
60 struct timespec tv;
61 __be32 midtime;
62 getnstimeofday(&tv);
63 midtime = htonl((tv.tv_sec % 86400) * MSEC_PER_SEC
+ tv.tv_nsec / NSEC_PER_MSEC);
64 memcpy(iph+opt->ts+iph[opt->ts+2]-5, &midtime, 4);
65 }
66 return;
67 }
68 if (opt->rr) {
69 memset(iph+opt->rr, IPOPT_NOP, iph[opt->rr+1]);
70 opt->rr = 0;
71 opt->rr_needaddr = 0;
72 }
73 if (opt->ts) {
74 memset(iph+opt->ts, IPOPT_NOP, iph[opt->ts+1]);
75 opt->ts = 0;
76 opt->ts_needaddr = opt->ts_needtime = 0;
77 }
78}
If I remember correctly, iph + opt->srr is basically the address of the first byte of the srr option. The format of the option itself is as follows:
TYPE (1 byte) | LENGTH (1 byte) | OFFSET (1 byte) | ... and then some addresses 4 bytes each
The LENGTH "field" specifies the length in bytes of the entire option, so that's why iph[opt->srr+1] is the length of the option.

Resources