In a dual socket -- Linux using CPU Socket 1 and some-other-program running on CPU Socket 2 -- how to prevent memory region overlaps - linux

I'm reading about Operating system privilege levels and how the CPU's MMU in cooperation with the OS, prevents invalid memory access.
I think I understand how that works, but I fail to understand how that would work if there is another non-OS running on Socket 1 while Linux is just running on Socket 1.
My question is as follows:
Is it not possible to run this configuration -- Linux just given control of CPU Socket 1 when booting up OS and a different OS(program) takes control of CPU Socket 2.
If yes, then how do you prevent these two OSs (or Linux and the other program) from stomping on overlapping regions of memory
How does privilege level security work in this case?

Related

I/O instructions by user-mode process

There are two ways to access hardware:
by memory-mapped I/O (MMIO)
by I/O ports
If a user-mode process directly wants to access I/O without using system calls and it knows about a particular piece of hardware it can't access it through memory-mapped I/O as it's not in its address space, so a segmentation fault will occur. But what about using I/O ports to do so? As I/O ports are not in the memory address space, but are accessed by instructions executed directly executed by processor, what will happen, can the process access them or not?
Short version
No. For example, in the x86 architecture, in and out are privileged instructions that can only run in Ring 0. Since user-mode applications run at Ring 3, the CPU will fault when trying to execute that instruction.
Long version
In general, in an OS that enforces memory protection and process separation (such as Windows, Linux, Mac OS, but not DOS), user-mode applications don't have direct access to the hardware, and use system calls to ask the kernel to do the actions required on the their behalf. The kernel can then decide if the application has the right to do said action, if the action is safe to do etc.
If a user-mode application is able to do anything that can normally only be done by the kernel, it is a bug/backdoor and a huge security vulnerability in the kernel (or sometimes, the CPU itself).
On the x86 architecture, I/O port instructions are privileged in all operating modes except Real Mode. An attempt to execute any privileged instruction will generate a fault (specifically, #GP, General-Protection Fault, which corresponds to INT 0xD).
In order to allow processes like user-mode drivers to work, the OS is able to override this privileged status of I/O instructions in 2 ways:
Use the IOPL (I/O privilege level) field in the FLAGS register (bits 12 and 13). This is a number from 0 to 3 that specifies the least privileged ring that can run I/O instructions. Setting IOPL to 3 will allow ALL processes to access ANY I/O port, which is insecure and defeats the purpose of protecting everything else like memory. For this reason, operating systems usually set this field to 0.
Use the IOPB (I/O Permissions Bitmap) bitfield in the Task State Segment (TSS) structure. This is simply a collection of 65536 bits, each specifying if an I/O port can be accessed (0) or not (1). If a user-mode driver needs to access, say, port 0xAB, then the OS can clear the corresponding bit in the TSS and allow the user-mode code to access only that port, even though IOPL=0. If the operating system doesn't include all 65536 bits, those missing are assumed by the CPU to be privileged and thus inaccessible.

Impact of kernel on system call

I have two different machines:
Machine 1:
# uname
Linux
# uname -r
2.6.34.15-WR4.3.fp_x86_64_standard-00239-g7934205
Machine 2:
# uname
Linux
# uname -r
4.4.217-pc64-distro.git-v2.102-3-rc
A process "X" is having higher CPU in Machine 2 than in machine 1.
Process X is having exactly same code on both these machines. The load on the process is also same.
The question here is system calls being used here can have different CPU impacts as the kernel is different in both these machines?
P.S: Machine 2 is having some customization in OS as well.
Is there any way to check CPU usage by system call?
Oprofile not running on this machine.. May be not installed. Don't have permission to install the same.
Any other tool that can help?
Also Please can anyone help in letting me know if TCP SEND AND RECV buff have any impact on the CPU of a process?
The short answer is, yes. The CPU load of a process is comprised of its user space processing (i.e. the process code), and its kernel space (a.k.a "system", i.e. time spent in system calls). Because the system call implementation may have changed in between kernels, it is possible that this led to more processing in kernel space.
But there's a longer answer , and it would depend on which system calls these are, and whether it's the direct system call implementation, or a change in a kernel subsystem (e.g. I/O scheduler?), as well as those customizations you've mentioned.

Is it possible to circumvent OS security by not using the supplied System Calls?

I understand that an Operating System forces security policies on users when they use the system and filesystem via the System Calls supplied by stated OS.
Is it possible to circumvent this security by implementing your own hardware instructions instead of making use of the supplied System Call Interface of the OS? Even writing a single bit to a file where you normally have no access to would be enough.
First, for simplicity, I'm considering the OS and Kernel are the same thing.
A CPU can be in different modes when executing code.
Lets say a hypothetical CPU has just two modes of execution (Supervisor and User)
When in Supervisor mode, you are allowed to execute any instructions, and you have full access to the hardware resources.
When in User mode, there is subset of instructions you don't have access to, such has instructions to deal with hardware or change the CPU mode. Trying to execute one of those instructions will cause the OS to be notified your application is misbehaving, and it will be terminated. This notification is done through interrupts. Also, when in User mode, you will only have access to a portion of the memory, so your application can't even touch memory it is not supposed to.
Now, the trick for this to work is that while in Supervisor Mode, you can switch to User Mode, since it's a less privileged mode, but while in User Mode, you can't go back to Supervisor Mode, since the instructions for that are not permitted anymore.
The only way to go back to Supervisor mode is through system calls, or interrupts. That enables the OS to have full control of the hardware.
A possible example how everything fits together for this hypothetical CPU:
The CPU boots in Supervisor mode
Since the CPU starts in Supervisor Mode, the first thing to run has access to the full system. This is the OS.
The OS setups the hardware anyway it wants, memory protections, etc.
The OS launches any application you want after configuring permissions for that application. Launching the application switches to User Mode.
The application is running, and only has access to the resources the OS allowed when launching it. Any access to hardware resources need to go through System Calls.
I've only explained the flow for a single application.
As a bonus to help you understand how this fits together with several applications running, a simplified view of how preemptive multitasking works:
In a real-world situation. The OS will setup an hardware timer before launching any applications.
When this timer expires, it causes the CPU to interrupt whatever it was doing (e.g: Running an application), switch to Supervisor Mode and execute code at a predetermined location, which belongs to the OS and applications don't have access to.
Since we're back into Supervisor Mode and running OS code, the OS now picks the next application to run, setups any required permissions, switches to User Mode and resumes that application.
This timer interrupts are how you get the illusion of multitasking. The OS keeps changing between applications quickly.
The bottom line here is that unless there are bugs in the OS (or the hardware design), the only way an application can go from User Mode to Supervisor Mode is through the OS itself with a System Call.
This is the mechanism I use in my hobby project (a virtual computer) https://github.com/ruifig/G4DevKit.
HW devices are connected to CPU trough bus, and CPU does use to communicate with them in/out instructions to read/write values at I/O ports (not used with current HW too much, in early age of home computers this was the common way), or a part of device memory is "mapped" into CPU address space, and CPU controls the device by writing values at defined locations in that shared memory.
All of this should be not accessible at "user level" context, where common applications are executed by OS (so application trying to write to that shared device memory would crash on illegal memory access, actually that piece of memory is usually not even mapped into user space, ie. not existing from user application point of view). Direct in/out instructions are blocked too at CPU level.
The device is controlled by the driver code, which is either run is specially configured user-level context, which has the particular ports and memory mapped (micro-kernel model, where drivers are not part of kernel, like OS MINIX). This architecture is more robust (crash in driver can't take down kernel, kernel can isolate problematic driver and restart it, or just kill it completely), but the context switches between kernel and user level are a very costly operation, so the throughput of data is hurt a bit.
Or the device drivers code runs on kernel-level (monolithic kernel model like Linux), so any vulnerability in driver code can attack the kernel directly (still not trivial, but lot more easier than trying to get tunnel out of user context trough some kernel bug). But the overall performance of I/O is better (especially with devices like graphics cards or RAID disc clusters, where the data bandwidth goes into GiBs per second). For example this is the reason why early USB drivers are such huge security risk, as they tend to be bugged a lot, so a specially crafted USB device can execute some rogue code from device in kernel-level context.
So, as Hyd already answered, under ordinary circumstances, when everything works as it should, user-level application should be not able to emit single bit outside of it's user sandbox, and suspicious behaviour outside of system calls will be either ignored, or crash the app.
If you find a way to break this rule, it's security vulnerability and those get usually patched ASAP, when the OS vendor gets notified about it.
Although some of the current problems are difficult to patch. For example "row hammering" of current DRAM chips can't be fixed at SW (OS) or CPU (configuration/firmware flash) level at all! Most of the current PC HW is vulnerable to this kind of attack.
Or in mobile world the devices are using the radiochips which are based on legacy designs, with closed source firmware developed years ago, so if you have enough resources to pay for a research on these, it's very likely you would be able to seize any particular device by fake BTS station sending malicious radio signal to the target device.
Etc... it's constant war between vendors with security researchers to patch all vulnerabilities, and hackers to find ideally zero day exploit, or at least picking up users who don't patch their devices/SW fast enough with known bugs.
Not normally. If it is possible it is because of an operating system software error. If the software error is discovered it is fixed fast as it is considered to be a software vulnerability, which equals bad news.
"System" calls execute at a higher processor level than the application: generally kernel mode (but system systems have multiple system level modes).
What you see as a "system" call is actually just a wrapper that sets up registers then triggers a Change Mode Exception of some kind (the method is system specific). The system exception hander dispatches to the appropriate system server.
You cannot just write your own function and do bad things. True, sometimes people find bugs that allow circumventing the system protections. As a general principle, you cannot access devices unless you do it through the system services.

The Windows desktop becomes paralysed during heavy network I/O / Windows kernel allocates only 1 out of many CPUs?

Problem: We implement a video recording system on a Windows Server 2012 system. In spite of low CPU and memory consumption, we face serious performance problems.
Short program description: the application (VS2005/C++) creates many network sockets, each receiving a multicast UDP video stream from an Ethernet network. Per stream the application provides a receiver buffer by calling WSARecvFrom() (overlapped operation), waits in MsgWaitForMultipleObjects() for the Window's "data arrived" event, takes the data packet, and repeats all again in an endless loop. For testing, to assure minimal CPU and memory consumption beside the pure socket IO work, the application does nothing, neither any disk/file IO. The application process is configured to use all available cores on the machine (default affinity settings unchanged).
Tests run: the test is run on two different machines: a) a Windows 7 with 4 physical cores / 8 with hyper-threading, and b) a Windows Server 2012 with 12 physical cores / 24 with hyper-threading.
Both systems show the same problem: everything works fine up to a certain number of configured sockets / network streams. Increasing them further (and we need to) finally paralyses the Windows desktop (mouse-pointer, repainting). At this stage the total CPU load is still very low (i.e. 10-15%) and there is much free memory available. But the Task-Manager shows extremely one-sided CPU loads: CPU 0 nearly 100%, all other CPUs near to 0%. Changing the Processor Affinity for the process in the Task Manager doesn't help.
Question 1: it looks like CPU 0 is doing the whole kernel's network IO work. Is that likely ?
Question 2: if yes, is there a way to control the kernel's use of available CPUs? If yes, how ?
Question 3: if no, is there any other way to make Windows distribute the (kernel) network IO work to other CPUs (i.e. by installing multiple NIC Cards, each NIC receiving only a subset of the network streams, and bind each NIC to another CPU) ?
Most thankful for any hints from anybody out there.
I'm not a Windows server guy, but this sounds like an interrupt issue. This often happens in high throughput systems, especially real-time ones.
Background:
Simply speaking, for each packet your network interface generates an interrupt, informing the CPU that it needs to handle the newly arrived data. High throughput network cards (e.g. 10Gbps) that receive small packets can easily overwhelm the CPU with these interrupts.
Just to get a feel for the problem, let's do some math - if you saturate a 10G line with 100 byte packets, that means that (ideally) 12,500,000 packets are sent over the line each second. In reality, It's less due to overhead; say 10,000,000 packets per second (pps). Your 3Ghz cpu generates 3,000,000,000 clocks per second. So it needs to handle a packet a packet every 300 clock cycles. That's pretty hard for a general purpose machine.
Now, I don't know the rate of packet arrival in your case, nor do I know your average packet length. But based on the symptoms you described, you might have run into this issue.
Solutions
Offload work to your card
Modern day network cards, especially high throughput ones, support all kinds of useful offloads such as GRO, TOE, and others. These take some network related work off the CPU (such as checksum calculation, packet fragmentation etc) and put it onto the network card which carries dedicated hardware for performing it. Check out the offloads supported by your card. In Linux, managing offloading is performed using an application called ethtool. Since I never played with offloading in windows, I can only point in the direction of the most relevant windows article I found, but I can't offer any experience-based advice.
Use interrupt throttling.
Interrupt throttling is another ability of (some) network cards and their drivers which allows them to limit the number of interrupts your CPU receives, essentially interrupting the core once every few packets instead of once per packet.
Use a multi-queue network card, and set interrupt affinities.
Some network cards have multiple (packet) queues, and therefore multiple interrupt lines, one per queue. They split incoming traffic evenly between queues using a hash function, creating (usually) 8 or 16 flows at 1/8 or at 1/16 of the line rate. Each flow can be tied to a specific CPU core using interrupt affinity, and since the hash function is calculated on IPs and port numbers, and is deterministic, each TCP/IP level session will always be handled by the same core. In Linux, setting the affinity requires writing to /proc/irq/<interrupt number>/smp_affinity. In windows, this seems to be the way.

Is there a way to take advantage of multi-core when dealing with network connections?

When we doing network programming, no matter you use multi-process, multi-thread or select/poll(epoll), there is only one process/thread to deal with accept the connection on same port. And if you want to take advantage of multi-cores, you need to create worker processes/threads. But what about the bound is dealing with network connections? Is there a way to take advantage of multi-core when dealing with network connections?
I found some materials. And seems this is hard to complete.
Three-way hand shaking will be implicit done by the kernel. And in smp structure operating system will be divided into several critical zones. The same critical zone can't be run on more than one core at the same time.
All modern operating systems that run on PC hardware already have their network stacks heavily optimized for multi-core CPUs. For example, the packet handling code that pushes data to and from the network card is going to be independent of the TCP/IP stack code so a hardware interrupt can run to completion without disturbing the TCP code.
For most real-world applications though, the bulk of the work is between the packets. Data that comes in has to be processed and data that goes out has to be generated. That's up to application code, and that code can take advantage of multiple cores either by using multiple threads or multiple processes. How you do that best is very application and operating system specific. Windows, for example, has I/O completion ports which combine job discovery with multi-threaded job dispatch. Linux has epoll.
With just the network traffic, that's almost soley done by the network card (i.e. not the computer's CPU). Communication with the network card is usually single-threaded (queued by the OS so you can send/receive on multiple threads) because a NIC can only push/pop stuff off it's stack one-at-a-time.
It's up to your process to do what it needs in response to received data. That can be done on one thread and you can spawn other threads upon receive of data on that master thread and divide work up that way. If you have a language that supports asynchronous communications, I would try to get it to do most of the work to use multiple threads.

Resources