Can containers be implemented purely in userspace? - linux

There are a bunch of container mechanisms for Linux now: LXC, Docker, lmctfy, OpenVZ, Linux-VServer, etc. All of these either involve kernel patches or recently added Linux features like cgroups and seccomp.
I'm wondering if it would be possible to implement similar (OS-level) virtualization purely in userspace.
There's already a precedent for this - User Mode Linux. However, it also requires special kernel features to be reasonably fast and secure. Also, it is literally a Linux kernel running in userspace, which makes networking setup rather difficult.
I'm thinking more along the lines of a process that would act as an intermediary between spawned programs and the Linux kernel. You would start the process with the programs to spawn as arguments; it would track system calls they made, and block or redirect attempts to access the real root filesystem, real network devices, etc. without itself relying on special kernel features.
Is such a thing possible to implement securely, and in a way that could be invoked effectively by a limited user (i.e. not privileged like chroot)?
In summary: would a pure userspace implementation of something like LXC be possible? If yes, what would the penalties be for doing it in userspace? If no, why not?

Surprisingly it turns out the answer is "yes": this is what systrace and sysjail do.
http://sysjail.bsd.lv/
And they are also inherently insecure on modern operating systems.
http://www.watson.org/~robert/2007woot/
So if you want proper sandboxing, it has to be done in kernel space.

Related

Linux's system calls for GUI? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I'm studying Operating Systems. I read Window have lots of system calls for manage windows and GUI components. I have read you can change the GUI manager of your Linux Operating System. Then does Linux have system calls for GUI managements? How GUI works in Linux?
I'll take x86 as an example as I am more aware of x86 stuff than ARM stuff. Also, I may get some information wrong as I've been doing some research on this question while answering. Feel free to correct me if I am wrong.
System booting
Some time ago, Linux used to boot with a legacy bootloader (GRUB legacy version). The GRUB bootloader would be started by the BIOS at 0x7c00 in RAM and then would read the kernel from the hard-disk. It would follow the multiboot specification. The multiboot specification mentions the state that the computer needs to be in before jumping to the kernel's entry point. The kernel would then launch a first process (init) that every process would be a child of.
Today, most Linux distributions boot with UEFI (with the option of legacy booting also available). A UEFI app is placed on the boot partition partionned as a GPT ESP (EFI System Partition). This EFI app is launched and then follows the Linux Boot Protocol to launch Linux. The init process was also replaced by systemd. Linux will thus launch systemd as the first process of the computer. Actually, as stated on the manpage for systemd:
systemd is usually not invoked directly by the user, but is
installed as the /sbin/init symlink and started during early
boot.
The process that will be started is thus /sbin/init but it is a symlink to systemd. The systemd process will then read several configuration files on the hard-disk called units. These units are often targets which specify several units to read. Targets are thus units which specify several units to read. At first systemd will read default.target which specifies several other units. Some of these other units will start some processes among which is the Display Manager (fancy terminology which means login prompt). Recently, Ubuntu starts the Gnome Display Manager (GDM) as the first displaying program (gdm.service unit). This program will start the X server before presenting the user login screen (https://en.wikipedia.org/wiki/X_display_manager).
When the display manager runs on the user's computer, it starts the X server before presenting the user the login screen, optionally repeating when the user logs out.
Once logged in, GDM will start several other binaries responsible to let you interact with the system (the actual desktop, a binary to gather input for this desktop, etc). All of these components depend on the X server to work properly.
The DRM
The X server is a user program which makes extensive use of the Direct Rendering Manager (DRM) of the Linux kernel. The DRM is a system call interface which is used to interact with graphics cards. When the DRM detects a graphics card, it exposes a file like /dev/dri/card0 which is a character device (http://manpages.ubuntu.com/manpages/bionic/man7/drm.7.html).
In earlier days, the kernel framework was solely used to provide raw hardware access to
priviledged user-space processes which implement all the hardware abstraction layers. But
more and more tasks were moved into the kernel. All these interfaces are based on ioctl(2)
commands on the DRM character device. The libdrm library provides wrappers for these
system-calls and many helpers to simplify the API.
When a GPU is detected, the DRM system loads a driver for the detected hardware type. Each
connected GPU is then presented to user-space via a character-device that is usually
available as /dev/dri/card0 and can be accessed with open(2) and close(2). However, it
still depends on the grapics driver which interfaces are available on these devices. If an
interface is not available, the syscalls will fail with EINVAL.
The ioctl call allows to have any number of operations on the /dev/dri/card0 file since it is a general call which includes a request argument which is simply an unsigned long. It also takes a variable amount of arguments (see https://man7.org/linux/man-pages/man2/ioctl.2.html).
The ioctl call thus allows hardware vendors (like NVIDIA, AMD, etc) to provide drivers for their cards with the general ioctl call used as a general interface between user mode and kernel mode.
OpenGL
There exists several 3D rendering APIs available (OpenGL, Direct3D). OpenGL is mostly a set of C headers and a convention. The convention says what a certain call should do. It is up to the hardware vendor to implement the convention for their own card. Mesa3D has been an attempt to create an open source implementation of OpenGL for certain graphics cards. It worked quite well for integrated Intel HD Graphics (since documentation is open) and AMD (since they cooperated and offered some insight into the workings of their cards), but not for NVIDIA (the Nouveau driver is mostly not working or slow).
When you program some OpenGL, you include the OpenGL headers and link with libraries provided by hardware vendors which provide the definitions of the functions in the headers. These definitions make use of the DRM and cooperate with the X server to show content on the screen.
I'm studying Operating Systems. I read Window have lots of system calls for manage windows and GUI components. I have read you can change the GUI manager of your Linux Operating System. Then does Linux have system calls for GUI managements? How GUI works in Linux?
System calls (provided by the kernel) are often buried (e.g. in some cases deliberately undocumented and proprietary) and should not be used. Almost everything you see are actually normal functions in dynamically linked libraries/shared libraries. This allows the kernel's system calls to be radically changed without breaking everything (because everything only depends on the dynamically linked libraries/shared libraries); and reduces the functionality needed in the kernel itself.
For an example; most of the "system calls for managing windows and GUI components" you think Windows has could (internally, inside the relevant DLL) just end up using a single "send_message()" system call (to tell a different process, the GUI, that you want to create a window or change its position or ...).
For Linux it's roughly similar. The kernel's system calls (which actually are documented, for no sane reason - it goes against the spirit of SYS-V specs and means badly written "linux executables" aren't compatible with other Unix clones like FreeBSD or Solaris or OSX) exist to use things like low level memory management and raw file IO and sockets; but (like Windows) the kernel's system calls are buried under layers of shared libraries, and those shared libraries (e.g. like Xlib, GLib, KWindowSystem, Qt, ...) just use "something" (file IO, pipes, sockets, ...) provided by kernel to talk to another process (display server, GUI, ..).
Linux and Windows fall under separate categories; Linux is just a kernel, i.e. the piece under the hood that gives us the basic functionality we expect to run programs, like threads, memory and process management, etc. Windows is a full operating system, including the user facing components and numerous system libraries. An apter comparison would be a specific Linux distro and Windows.
On that note, distros, as independent operating systems, obviously can have different implementations of any OS component. Some distros, like Arch, don't come with a GUI by default at all. That said, essentially the entire Linux ecosystem uses Xorg and/or Wayland; I would recommend looking into the implementation details of those two.
A Linux GUI has quite a few differences compared to Windows GUI. For example, the GUI is not considered to be a part of the operating system, but rather an external part of it; that means no syscalls (not embedded whatsoever in the OS). After all, like the previous answer says, Linux is a kernel, that means it's only something really basic (allows execution of programs, memory/threads management, processes management, but not really much more). Whatever comes next (GUI, for example) are added features using packages.
This allows, for example, installing a GUI on top of a minimal installation of any Linux distro (CentOS, for example), and that GUI can be the one you want (Gnome, KDE...).

Is there any relationship between Secure Boot and Kernel Lockdown?

As far as I googled till now, the two features seem independent.
Secure Boot is dependent on Kernel Signature, so the bootloader will be checking (Kernel/Single Image Application) Signature if Valid will Call Kernel Start Function.
Lockdown is another feature where "The lockdown code is intended to allow for kernels to be locked down early in boot - sufficiently early that we don't have the ability to kmalloc() yet. Disallowing even privileged User from accessing Confidential Data present in kernel memory."
Lockdown comes in via boot parameters/sysfs control after kernel is authenticated to be
valid.
Is my understanding correct?
So with a Secure Boot Disabled, the Lockdown Feature should still work.
Yes, I'd say that your understanding is correct.
Secure boot is a security feature implemented in hardware (i.e. directly in your CPU, though it could also be implemented in UEFI firmware). It's a mechanism of verification that is done as the very first thing when powering on the computer. Some known public keys are stored in hardware and are used to verify the signature of the bootloader before running it. The process can be then repeated through the multiple stages of the boot process where each stage verifies the following one until the OS is started. To know more, take a look at this page of the Debian documentation about secure boot.
Kernel lockdown is a security feature of the Linux kernel, which was recently introduced in version 5.4 as an optional security module. As mentioned in this interesting article from LWN, the purpose of kernel lockdown is to enforce a distinction between running as root and the ability to run code in kernel mode. Depending on the configuration, kernel lockdown can disable features of the kernel that allow modifications of the running kernel or the extraction of confidential information from userspace. Take a look at the relevant commit message that introduced the feature to Linux.
The relationship between secure boot and kernel lockdown can be explained by this very important consideration (from the same LWN article linked above):
Proponents of UEFI secure boot maintain that this separation [i.e. kernel lockdown] is necessary; otherwise the promise of secure boot (that the system will only run trusted code in kernel mode) cannot be kept. Closing off the paths by which a privileged attacker could run arbitrary code in kernel mode requires disabling a number of features in the kernel.
In other words, one could argue that secure boot is useless if the kernel that is verified and ran can be then modified by userland processes. Indeed, without proper kernel hardening, a potential attacker could exploit privileged userland processes to alter the running kernel and therefore tear down the whole security model of the system.
On the other hand, one could as easily argue that kernel lockdown is useless without secure boot since a potential attacker could compromise the boot chain (e.g. modifying the bootloader or the kernel image on the disk) and make the machine run a modified kernel on the next boot.
Nonetheless, the two features are independent of each other. There are still valid use cases of secure boot without kernel lockdown, and vice versa kernel lockdown without secure boot. It all ultimately depends on what's your threat model.

Is it possible to circumvent OS security by not using the supplied System Calls?

I understand that an Operating System forces security policies on users when they use the system and filesystem via the System Calls supplied by stated OS.
Is it possible to circumvent this security by implementing your own hardware instructions instead of making use of the supplied System Call Interface of the OS? Even writing a single bit to a file where you normally have no access to would be enough.
First, for simplicity, I'm considering the OS and Kernel are the same thing.
A CPU can be in different modes when executing code.
Lets say a hypothetical CPU has just two modes of execution (Supervisor and User)
When in Supervisor mode, you are allowed to execute any instructions, and you have full access to the hardware resources.
When in User mode, there is subset of instructions you don't have access to, such has instructions to deal with hardware or change the CPU mode. Trying to execute one of those instructions will cause the OS to be notified your application is misbehaving, and it will be terminated. This notification is done through interrupts. Also, when in User mode, you will only have access to a portion of the memory, so your application can't even touch memory it is not supposed to.
Now, the trick for this to work is that while in Supervisor Mode, you can switch to User Mode, since it's a less privileged mode, but while in User Mode, you can't go back to Supervisor Mode, since the instructions for that are not permitted anymore.
The only way to go back to Supervisor mode is through system calls, or interrupts. That enables the OS to have full control of the hardware.
A possible example how everything fits together for this hypothetical CPU:
The CPU boots in Supervisor mode
Since the CPU starts in Supervisor Mode, the first thing to run has access to the full system. This is the OS.
The OS setups the hardware anyway it wants, memory protections, etc.
The OS launches any application you want after configuring permissions for that application. Launching the application switches to User Mode.
The application is running, and only has access to the resources the OS allowed when launching it. Any access to hardware resources need to go through System Calls.
I've only explained the flow for a single application.
As a bonus to help you understand how this fits together with several applications running, a simplified view of how preemptive multitasking works:
In a real-world situation. The OS will setup an hardware timer before launching any applications.
When this timer expires, it causes the CPU to interrupt whatever it was doing (e.g: Running an application), switch to Supervisor Mode and execute code at a predetermined location, which belongs to the OS and applications don't have access to.
Since we're back into Supervisor Mode and running OS code, the OS now picks the next application to run, setups any required permissions, switches to User Mode and resumes that application.
This timer interrupts are how you get the illusion of multitasking. The OS keeps changing between applications quickly.
The bottom line here is that unless there are bugs in the OS (or the hardware design), the only way an application can go from User Mode to Supervisor Mode is through the OS itself with a System Call.
This is the mechanism I use in my hobby project (a virtual computer) https://github.com/ruifig/G4DevKit.
HW devices are connected to CPU trough bus, and CPU does use to communicate with them in/out instructions to read/write values at I/O ports (not used with current HW too much, in early age of home computers this was the common way), or a part of device memory is "mapped" into CPU address space, and CPU controls the device by writing values at defined locations in that shared memory.
All of this should be not accessible at "user level" context, where common applications are executed by OS (so application trying to write to that shared device memory would crash on illegal memory access, actually that piece of memory is usually not even mapped into user space, ie. not existing from user application point of view). Direct in/out instructions are blocked too at CPU level.
The device is controlled by the driver code, which is either run is specially configured user-level context, which has the particular ports and memory mapped (micro-kernel model, where drivers are not part of kernel, like OS MINIX). This architecture is more robust (crash in driver can't take down kernel, kernel can isolate problematic driver and restart it, or just kill it completely), but the context switches between kernel and user level are a very costly operation, so the throughput of data is hurt a bit.
Or the device drivers code runs on kernel-level (monolithic kernel model like Linux), so any vulnerability in driver code can attack the kernel directly (still not trivial, but lot more easier than trying to get tunnel out of user context trough some kernel bug). But the overall performance of I/O is better (especially with devices like graphics cards or RAID disc clusters, where the data bandwidth goes into GiBs per second). For example this is the reason why early USB drivers are such huge security risk, as they tend to be bugged a lot, so a specially crafted USB device can execute some rogue code from device in kernel-level context.
So, as Hyd already answered, under ordinary circumstances, when everything works as it should, user-level application should be not able to emit single bit outside of it's user sandbox, and suspicious behaviour outside of system calls will be either ignored, or crash the app.
If you find a way to break this rule, it's security vulnerability and those get usually patched ASAP, when the OS vendor gets notified about it.
Although some of the current problems are difficult to patch. For example "row hammering" of current DRAM chips can't be fixed at SW (OS) or CPU (configuration/firmware flash) level at all! Most of the current PC HW is vulnerable to this kind of attack.
Or in mobile world the devices are using the radiochips which are based on legacy designs, with closed source firmware developed years ago, so if you have enough resources to pay for a research on these, it's very likely you would be able to seize any particular device by fake BTS station sending malicious radio signal to the target device.
Etc... it's constant war between vendors with security researchers to patch all vulnerabilities, and hackers to find ideally zero day exploit, or at least picking up users who don't patch their devices/SW fast enough with known bugs.
Not normally. If it is possible it is because of an operating system software error. If the software error is discovered it is fixed fast as it is considered to be a software vulnerability, which equals bad news.
"System" calls execute at a higher processor level than the application: generally kernel mode (but system systems have multiple system level modes).
What you see as a "system" call is actually just a wrapper that sets up registers then triggers a Change Mode Exception of some kind (the method is system specific). The system exception hander dispatches to the appropriate system server.
You cannot just write your own function and do bad things. True, sometimes people find bugs that allow circumventing the system protections. As a general principle, you cannot access devices unless you do it through the system services.

Can bash be used to communicate directly with hardware?

I am interested in writing my own tool in bash to act in place of my current network controller (wpa_supplicant) if possible. For example if I want to issue commands in order to begin a wps authentication session with a router's external registrar, is it possible, without using any pre-built tools, to communicate with the kernel to directly access the hardware? I have been told that I can achieve what I desire with a bash plugin called ctypes.sh but I am not too certain.
Generally speaking, the Linux kernel can interact with user-space through the following mechanisms:
Syscalls
Devices in /dev
Entries in /sys
Entries in /proc
Syscalls cannot be directly used from Bash but you need at least a binding through a C program.
You can create a Linux kernel driver or module which reads/writes data in an entry under /proc or /sys and then use a bash program to interact with it. Even if technically feasible, my personal opinion is that it is an overkill, and the usual C/C++ user-level programming with proper entries in /dev is much better.
No, this is generally not possible. Shell scripts can mess around in /proc, but they don't have the ability to perform arbitrary IOCTLs or even multi-step interactive IO. It's the wrong tool for the job.

Minimum configuration to run embedded Linux on an ARM processor?

I need to produce an embedded ARM design that has requirements to do many things that embedded Linux would do. However the design is cost sensitive and does not need huge amounts of horse power. Mostly will be talking to serial interfaces. Ideally I would like to use one of the low end ARMs. What is the lowest configuration of an ARM that you have successfully used embedded Linux on.
Edit:
The application needs a file system on some kind of flash device and the ability to run applications for processing the data. Some of the applications might be written by others than myself. I also need to ability to load new applications or update old apps using the serial ports to accept the apps.
When I have looked at other embedded OSes they seem to be more of a real time threading solution than having the ability to run applications. I am open to what ever will get the job done.
I think you need to weigh your cost options here.
ARM + linux is an option but you will be paying a very high operating overhead for such a simple (from your description) set of features. You can't just look at the cost of the ARM chip but must also consider external RAM which will very likely be required as well as flash to get enough space available to run the kernel + apps.
NOTE: you may be able to avoid the external requirements with a very minimal kernel and simple apps combined with a uC with large internal resources.
A second option is a much simpler microcontroller with a light weight OS. This will cut your hardware costs on the CPU and you can likely run something like this without external RAM or flash (dependent on application RAM and program space requirement)
third option: I don't actually see anything in your requirements that demands any OS at all be used. Basic file systems are very simple, for instance there are even FAT drivers out there for 8 bit PIC's. Interfacing to an SD card only requires a SPI port and minimal external circuitry.
The application bit could be simple or complex. I've built systems around PIC18 microcontollers that run a web server and allow program updates via a simple upload screen, it just stores the new program into an EEPROM or flash, reboots into a bootloader and copies the new program into internal program memory. You could likely design a way to do this without the reboot via a cooperative multitasking type of architecture. Any way you go the programmers writing the apps are going to need to have knowledge of the architecture and access to libraries / driver you write. Your best bet to simplify this is to provide as simple an API as possible and to try to automate the build process for them.
The third option will be the "cheapest" in terms of hardware as there will be very little overhead in the processing of your applications allowing you to get away with minimal processing power and memory. It likely will require some more programming/software architecting on your part but won't require nearly the research you will need to undertake to get linux up and running in addition to learning to write the needed device drivers under a linux paradigm.
As always you have to include the software development costs in the build cost of the device. If you plan to build 10,000+ of these your likely better off keeping hardware costs down and putting more man power into designing a software solution that allows that hardware to meet the design goals. If your building 10 of them, your better off spending an extra $15-20 on hardware if it can cut down on your software development costs. For example an ARM with MMU with full linux kernel support and available device drivers.
I kind of feel that your selecting the worst of both worlds at the moment, your paying extra to get a uC you can run linux on but by doing so your also selecting a part that will likely be the most complex to get linux up and running on, especially having not worked with linux on embedded platforms before.
I've had success even on ARM7TDMI, so I don't think you're going to have any trouble. If you have a low-requirements system, you could use any kind of lightweight real-time executive and have a lot better experience than you would getting Linux to work.
I've used a TS-7200 for about five years to run a web server and mail server, using Debian GNU Linux. It is 200 MHz and has 32 MB of RAM, and is quite adequate for these tasks. It has serial port built in. It's based on a ARM920T.
This would be overkill for your job; I mention it so you have another data point.
For several years I've been using a gumstix to do prototyping and testing and I've had good results with it. I don't know if the processor they are using (Intel PXA255 on my board) is considered low-cost, but the entire Verdex line seems pretty cheap to me for an adaptable device.
ucLinux is designed specifically for resource constrained targets, but perhaps more importantly for targets without an MMU.
However you have to have a good reason to use Linux on such a system rather than a small real-time executive. Out-of-the-box networking, readily available drivers and protocol stacks for complex hardware and support for existing POSIX legacy or open source code are a few perhaps. However if you don't need that, Linux is still large, and you may be squandering resources for no real benefit. In most cases you will still need off-chip SDRAM and Flash if you choose Linux of any flavour.
I would not regard serial I/O as 'complex hardware', so unless you are running a complex, but standard protocol, your brief description does not appear to warrant the use of Linux IMO
My DLINK DIR-320 router runs Linux inside.
And I know some handymen, flashing it with Optware and connecting USB-hub, HDDs, USB-flash, and much more.
It's low-cost ready for use "platform". (If you don't need mass production). But maybe more powerful than you need.
Additionally, it can be configured wirelessly via web-interface even through your pda :)

Resources