I wondered why we need to switch to kernel space when we want to access a hardware device. I understand that sometimes, for specific actions such as memory allocation, we need to make system calls in order to switch from user space to kernel space because the operating system needs to organize everything and make a separation between processes and how they use memory and others. But why we can't directly access a hardware device ?
There is no problem in writing your own driver to access the hardware from User Space and plenty of documentation is available. For example, this tutorial at xatlantis seems to be recent and good source.
The reason it has been designed like that is because mainly due to security reasons .Most systems I know about specifically do not allow user programs to do I/O or to access kernel space memory. Such things would lead to wildly insecure systems, because with access to the kernel a user program could change permissions and get access to any data anywhere in the system, and presumably change it.
References:
XATLANTIS
STACKEXCHANGE
A device-driver may choose to provide access from user processes to device registers, device memory, or both. A common method is a device-specific service connected with an mmap() request. Consider a frame-buffer's on-board memory, and efficiency from a user process being able to r/w that space directly. For devices in general, notably there are security considerations and drivers that provide direct access often set limits to processes with sufficient credentials. Files within /dev are usually set with owner/group access permissions similarly limited.
I understand that an Operating System forces security policies on users when they use the system and filesystem via the System Calls supplied by stated OS.
Is it possible to circumvent this security by implementing your own hardware instructions instead of making use of the supplied System Call Interface of the OS? Even writing a single bit to a file where you normally have no access to would be enough.
First, for simplicity, I'm considering the OS and Kernel are the same thing.
A CPU can be in different modes when executing code.
Lets say a hypothetical CPU has just two modes of execution (Supervisor and User)
When in Supervisor mode, you are allowed to execute any instructions, and you have full access to the hardware resources.
When in User mode, there is subset of instructions you don't have access to, such has instructions to deal with hardware or change the CPU mode. Trying to execute one of those instructions will cause the OS to be notified your application is misbehaving, and it will be terminated. This notification is done through interrupts. Also, when in User mode, you will only have access to a portion of the memory, so your application can't even touch memory it is not supposed to.
Now, the trick for this to work is that while in Supervisor Mode, you can switch to User Mode, since it's a less privileged mode, but while in User Mode, you can't go back to Supervisor Mode, since the instructions for that are not permitted anymore.
The only way to go back to Supervisor mode is through system calls, or interrupts. That enables the OS to have full control of the hardware.
A possible example how everything fits together for this hypothetical CPU:
The CPU boots in Supervisor mode
Since the CPU starts in Supervisor Mode, the first thing to run has access to the full system. This is the OS.
The OS setups the hardware anyway it wants, memory protections, etc.
The OS launches any application you want after configuring permissions for that application. Launching the application switches to User Mode.
The application is running, and only has access to the resources the OS allowed when launching it. Any access to hardware resources need to go through System Calls.
I've only explained the flow for a single application.
As a bonus to help you understand how this fits together with several applications running, a simplified view of how preemptive multitasking works:
In a real-world situation. The OS will setup an hardware timer before launching any applications.
When this timer expires, it causes the CPU to interrupt whatever it was doing (e.g: Running an application), switch to Supervisor Mode and execute code at a predetermined location, which belongs to the OS and applications don't have access to.
Since we're back into Supervisor Mode and running OS code, the OS now picks the next application to run, setups any required permissions, switches to User Mode and resumes that application.
This timer interrupts are how you get the illusion of multitasking. The OS keeps changing between applications quickly.
The bottom line here is that unless there are bugs in the OS (or the hardware design), the only way an application can go from User Mode to Supervisor Mode is through the OS itself with a System Call.
This is the mechanism I use in my hobby project (a virtual computer) https://github.com/ruifig/G4DevKit.
HW devices are connected to CPU trough bus, and CPU does use to communicate with them in/out instructions to read/write values at I/O ports (not used with current HW too much, in early age of home computers this was the common way), or a part of device memory is "mapped" into CPU address space, and CPU controls the device by writing values at defined locations in that shared memory.
All of this should be not accessible at "user level" context, where common applications are executed by OS (so application trying to write to that shared device memory would crash on illegal memory access, actually that piece of memory is usually not even mapped into user space, ie. not existing from user application point of view). Direct in/out instructions are blocked too at CPU level.
The device is controlled by the driver code, which is either run is specially configured user-level context, which has the particular ports and memory mapped (micro-kernel model, where drivers are not part of kernel, like OS MINIX). This architecture is more robust (crash in driver can't take down kernel, kernel can isolate problematic driver and restart it, or just kill it completely), but the context switches between kernel and user level are a very costly operation, so the throughput of data is hurt a bit.
Or the device drivers code runs on kernel-level (monolithic kernel model like Linux), so any vulnerability in driver code can attack the kernel directly (still not trivial, but lot more easier than trying to get tunnel out of user context trough some kernel bug). But the overall performance of I/O is better (especially with devices like graphics cards or RAID disc clusters, where the data bandwidth goes into GiBs per second). For example this is the reason why early USB drivers are such huge security risk, as they tend to be bugged a lot, so a specially crafted USB device can execute some rogue code from device in kernel-level context.
So, as Hyd already answered, under ordinary circumstances, when everything works as it should, user-level application should be not able to emit single bit outside of it's user sandbox, and suspicious behaviour outside of system calls will be either ignored, or crash the app.
If you find a way to break this rule, it's security vulnerability and those get usually patched ASAP, when the OS vendor gets notified about it.
Although some of the current problems are difficult to patch. For example "row hammering" of current DRAM chips can't be fixed at SW (OS) or CPU (configuration/firmware flash) level at all! Most of the current PC HW is vulnerable to this kind of attack.
Or in mobile world the devices are using the radiochips which are based on legacy designs, with closed source firmware developed years ago, so if you have enough resources to pay for a research on these, it's very likely you would be able to seize any particular device by fake BTS station sending malicious radio signal to the target device.
Etc... it's constant war between vendors with security researchers to patch all vulnerabilities, and hackers to find ideally zero day exploit, or at least picking up users who don't patch their devices/SW fast enough with known bugs.
Not normally. If it is possible it is because of an operating system software error. If the software error is discovered it is fixed fast as it is considered to be a software vulnerability, which equals bad news.
"System" calls execute at a higher processor level than the application: generally kernel mode (but system systems have multiple system level modes).
What you see as a "system" call is actually just a wrapper that sets up registers then triggers a Change Mode Exception of some kind (the method is system specific). The system exception hander dispatches to the appropriate system server.
You cannot just write your own function and do bad things. True, sometimes people find bugs that allow circumventing the system protections. As a general principle, you cannot access devices unless you do it through the system services.
My main question is there piece of code running in X-Server process memory (Excluded drivers - which we all know can be written in different manners) is directly accessing memory in GPU card?
Or it employs drivers and drm, or any other interface for communication with GPU and queuing draw/render/clear/... commands?
I know question seems lame, but I am interested in specifics?
EDIT:
More specifically: to my understanding kernel communicates with hardware with assistance from drivers, and exposes API to the rest (if I am wrong please correct me).
In this context can X-Server circumvent DMA-API (I am only guessing DMA IO is responsible for communication with periferials) located in kernel to communicate and exchange data with GPU card (in a direct way - without anyones assistance == without kernel, drivers, ...)?
And what would be bare minimum requirement for X-Server to communicate with GPU. I am aiming to understand how this communication is done on low level.
It is entirely possible that on Linux a given X server accesses part of the video card memory directly as a framebuffer. It's not the most efficient way of displaying things, but it works.
We are writing a highly concurrent software in C++ for a few hosts, all equipped with a single ST9500620NS as the system drive and an Intel P3700 NVMe Gen3 PCIe SSD card for data. Trying to understand the system more for tuning our software, I dug around the system (two E5-2620 v2 # 2.10GHz CPUs, 32GB RAM, running CentOS 7.0) and was surprised to spot the following:
[root#sc2u0n0 ~]# cat /sys/block/nvme0n1/queue/scheduler
none
This contradicts to everything that I learned about selecting the correct Linux I/O scheduler, such as from the official doc on kernel.org.
I understand that NVMe is a new kid on the block, so for now I won't touch the existing scheduler setting. But I really feel odd about the "none" put in by the installer. If anyone who has some hints as to where I can find more info or share your findings, I would be grateful. I have spent many hours googling without finding anything concrete so far.
The answer given by Sanne in the comments is correct:
"The reason is that NVMe bypasses the scheduler. You're not using the "noop" implementation: you're not using a scheduler."
noop is not the same as none, noop still performs block merging (unless you disable it with nomerges)
If you use an nvme device, or if you enable "scsi_mod.use_blk_mq=Y" at compile time or boot time, then you bypass the traditional request queue and its associated schedulers.
Schedulers for blk-mq might be developed in the future.
"none" (aka "noop") is the correct scheduler to use for this device.
I/O schedulers are primarily useful for slower storage devices with limited queueing (e.g, single mechanical hard drives) — the purpose of an I/O scheduler is to reorder I/O requests to get more important ones serviced earlier. For a device with a very large internal queue, and very fast service (like a PCIe SSD!), an I/O scheduler won't do you any good; you're better off just submitting all requests to the device immediately.
When writing embedded ARM code, it's easy to access to the built-in zero wait state memory to accelerate your application. Windows CE doesn't expose this to user-mode applications, but there is probably a way to do it. The internal SRAM is usually used for the video buffer, but there's usually some left over. Anyone know how to do it?
Thanks,
Larry B.
Unfortunately you can't access the high speed ram from usermode-processes.
The only way to get access to it on a WindowsCE-OS is to write a driver, map the fixed address of the TCM into the user-mode process address space and pass it to the user-mode process.