I'm developing a program which need to access a special USB device. This USB device acts as a regular file in filesystem, so I have to open this file with O_DIRECT flag. As follow:
open(pathname, O_CREAT | O_RDWR | O_DIRECT | O_SYNC, S_IRWXU)
The program works well on PC environment. But when I port it to embedded board with openwrt, the "open" function returns EINVAL 22 /* Invalid argument */.
O_DIRECT support is selected in kernel configuration.
The filesystem of openwrt is squashfs and jffs2.
The filesystem of USB device is fat, and mounted on /media/aegis directory.
The ARCH of board is mips.
It seems that error is returned from following function in kernel:
int open_check_o_direct(struct file *f)
{
/* NB: we're sure to have correct a_ops only after f_op->open */
if (f->f_flags & O_DIRECT) {
if (!f->f_mapping->a_ops ||
((!f->f_mapping->a_ops->direct_IO) &&
(!f->f_mapping->a_ops->get_xip_mem))) {
return -EINVAL;
}
}
return 0;
}
Is it known that O_DIRECT isn't supported on jffs2 and supported on fat. When operating on file in /media/aegis I guess the a_ops of fat is used, but program doesn't run in my expectation.
Related
I need help about managing Hugepages on raspberry pi 4 running raspberry pi OS 64 bit.
I did not find much reliable information online.
First I recompiled the kernel source enabling Memory Management options --->Transparent Hugepage Support option.
When I run the command:
grep -i huge /proc/meminfo
The output is:
AnonHugePages: 319488 kB
ShmemHugePages: 0 kB
FileHugePages: 0 k
and running the command:
cat /sys/kernel/mm/transparent_hugepage/enabled
the output is:
[always] madvise never
So I think Transparent Huge Pages (AnonHugePages) should be set.
I need to use HugePages to map the largest contiguous memory chunk using mmap function, c code.
mem = mmap(NULL,buf_size,PROT_READ|PROT_WRITE,MAP_SHARED,fd,0);
Looking at https://www.man7.org/linux/man-pages/man2/mmap.2.html there are two flags to manage the hugepages: MAP_HUGETLB flag and MAP_HUGE_2MB, MAP_HUGE_1GB flag.
My question is: To use HugePages should I map in this way?
mem = mmap(NULL,buf_size,PROT_READ|PROT_WRITE,MAP_SHARED,MAP_HUGETLB,fd,0);
Kernel configuration:
CONFIG_SYS_SUPPORTS_HUGETLBFS=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
CONFIG_TRANSPARENT_HUGE_PAGECACHE=y
# CONFIG_HUGETLBFS is not set
Huge pages are a way to enhance the performances of the applications by reducing the number of TLB misses. The mechanism coalesces contiguous standard physical pages (typical size of 4 KB) into a big one (e.g. 2 MB). Linux implements this feature in two flavors: Transparent Huge pages and explicit huge pages.
Transparent Huge Pages
Transparent huge pages (THP) are managed transparently by the kernel. The user space applications have no control on them. The kernel makes its best to allocate huge pages whenever it is possible but it is not guaranteed. Moreover, THP may introduce overhead as an underlying "garbage collector" kernel daemon named khugepaged is in charge of the coalescing of the physical pages to make huge pages. This may consume CPU time with undesirable effects on the performances of the running applications. In systems with time critical applications, it is generally advised to deactivate THP.
THP can be disabled on the boot command line (cf. the end of this answer) or from the shell in sysfs:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$ sudo sh -c "echo never > /sys/kernel/mm/transparent_hugepage/enabled"
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
N.B.: Some interesting papers exist on the performance evaluation/issues of the THP:
Transparent Hugepages: measuring the performance impact;
Settling the Myth of Transparent HugePages for Databases.
Explicit huge pages
If the huge pages are required at application level (i.e. from user space). HUGETLBFS kernel configuration must be set to activate the hugetlbfs pseudo-filesystem (the menu in the kernel configurator is something like: "File systems" --> "Pseudo filesystems" --> "HugeTLB file system support"). In the kernel source tree this parameter is in fs/Kconfig:
config HUGETLBFS
bool "HugeTLB file system support"
depends on X86 || IA64 || SPARC64 || (S390 && 64BIT) || \
SYS_SUPPORTS_HUGETLBFS || BROKEN
help
hugetlbfs is a filesystem backing for HugeTLB pages, based on
ramfs. For architectures that support it, say Y here and read
<file:Documentation/admin-guide/mm/hugetlbpage.rst> for details.
If unsure, say N.
For example, on an Ubuntu system, we can check:
$ cat /boot/config-5.4.0-53-generic | grep HUGETLBFS
CONFIG_HUGETLBFS=y
N.B.: On Raspberry Pi, it is possible to configure the apparition of /proc/config.gz and do the same with zcat to check the parameter. To make it, the configuration menu is: "General setup" --> "Kernel .config support" + "Enable access to .config through /proc/config.gz"
When this parameter is set, hugetlbfs pseudo-filesystem is added into the kernel build (cf. fs/Makefile):
obj-$(CONFIG_HUGETLBFS) += hugetlbfs/
The source code of hugetlbfs is located in fs/hugetlbfs/inode.c. At startup, the kernel will mount internal hugetlbfs file systems to support all the available huge page sizes for the architecture it is running on:
static int __init init_hugetlbfs_fs(void)
{
struct vfsmount *mnt;
struct hstate *h;
int error;
int i;
if (!hugepages_supported()) {
pr_info("disabling because there are no supported hugepage sizes\n");
return -ENOTSUPP;
}
error = -ENOMEM;
hugetlbfs_inode_cachep = kmem_cache_create("hugetlbfs_inode_cache",
sizeof(struct hugetlbfs_inode_info),
0, SLAB_ACCOUNT, init_once);
if (hugetlbfs_inode_cachep == NULL)
goto out;
error = register_filesystem(&hugetlbfs_fs_type);
if (error)
goto out_free;
/* default hstate mount is required */
mnt = mount_one_hugetlbfs(&hstates[default_hstate_idx]);
if (IS_ERR(mnt)) {
error = PTR_ERR(mnt);
goto out_unreg;
}
hugetlbfs_vfsmount[default_hstate_idx] = mnt;
/* other hstates are optional */
i = 0;
for_each_hstate(h) {
if (i == default_hstate_idx) {
i++;
continue;
}
mnt = mount_one_hugetlbfs(h);
if (IS_ERR(mnt))
hugetlbfs_vfsmount[i] = NULL;
else
hugetlbfs_vfsmount[i] = mnt;
i++;
}
return 0;
out_unreg:
(void)unregister_filesystem(&hugetlbfs_fs_type);
out_free:
kmem_cache_destroy(hugetlbfs_inode_cachep);
out:
return error;
}
A hugetlbfs file system is a sort of RAM file system into which the kernel creates files to back the memory regions mapped by the applications.
The amount of needed huge pages can be reserved by writing the number of needed huge pages into /sys/kernel/mm/hugepages/hugepages-hugepagesize/nr_hugepages.
Then, mmap() is able to map some part of the application address space onto huge pages. Here is an example showing how to do it:
#include <sys/mman.h>
#include <unistd.h>
#include <stdio.h>
#define HP_SIZE (2 * 1024 * 1024) // <-- Adjust with size of the supported HP size on your system
int main(void)
{
char *addr, *addr1;
// Map a Huge page
addr = mmap(NULL, HP_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED| MAP_HUGETLB, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap()");
return 1;
}
printf("Mapping located at address: %p\n", addr);
pause();
return 0;
}
In the preceding program, the memory pointed by addr is based on huge pages. Example of usage:
$ gcc alloc_hp.c -o alloc_hp
$ ./alloc_hp
mmap(): Cannot allocate memory
$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
0
$ sudo sh -c "echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages"
$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
1
$ ./alloc_hp
Mapping located at address: 0x7f7ef6c00000
In another terminal, the process map can be observed to verify the size of the memory page (it is blocked in pause() system call):
$ pidof alloc_hp
13009
$ cat /proc/13009/smaps
[...]
7f7ef6c00000-7f7ef6e00000 rw-s 00000000 00:0f 331939 /anon_hugepage (deleted)
Size: 2048 kB
KernelPageSize: 2048 kB <----- The page size is 2MB
MMUPageSize: 2048 kB
[...]
In the preceding map, the file name /anon_hugepage for the huge page region is made internally by the kernel. It is marked deleted because the kernel removes the associated memory file which will make the file disappear as soon as there are no longer references on it (e.g. when the calling process ends, the underlying file is closed upon exit(), the reference counter on the file drops to 0 and the remove operation finishes to make it disappear).
Allocation of other huge page sizes
On Raspberry Pi 4B, the default huge page size is 2MB but the card supports several other huge page sizes:
$ ls -l /sys/kernel/mm/hugepages
total 0
drwxr-xr-x 2 root root 0 Nov 23 14:58 hugepages-1048576kB
drwxr-xr-x 2 root root 0 Nov 23 14:58 hugepages-2048kB
drwxr-xr-x 2 root root 0 Nov 23 14:58 hugepages-32768kB
drwxr-xr-x 2 root root 0 Nov 23 14:58 hugepages-64kB
To use them, it is necessary to mount a hugetlbfs type file system corresponding to the size of the desired huge page. The kernel documentation provides details on the available mount options. For example, to mount a hugetlbfs file system on /mnt/huge with 8 Huge Pages of size 64KB, the command is:
mount -t hugetlbfs -o pagesize=64K,size=512K,min_size=512K none /mnt/huge
Then it is possible to map huge pages of 64KB in a user program. The following program creates the /tmp/hpfs directory on which it mounts a hugetlbfs file system with a size of 4 huge pages of 64KB. A file named /memfile_01 is created and extended to the size of 2 huge pages. The file is mapped into memory thanks to mmap() system call. It is not passed MAP_HUGETLB flag as the provided file descriptor is for a file created on a hugetlbfs filesystem. Then, the program calls pause() to suspend its execution in order to make some observations in another terminal:
#include <sys/types.h>
#include <errno.h>
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <fcntl.h>
#define ERR(fmt, ...) do { \
fprintf(stderr, \
"ERROR#%s#%d: "fmt, \
__FUNCTION__, __LINE__, ## __VA_ARGS__); \
} while(0)
#define HP_SIZE (64 * 1024)
#define HPFS_DIR "/tmp/hpfs"
#define HPFS_SIZE (4 * HP_SIZE)
int main(void)
{
void *addr;
char cmd[256];
int status;
int rc;
char mount_opts[256];
int fd;
rc = mkdir(HPFS_DIR, 0777);
if (0 != rc && EEXIST != errno) {
ERR("mkdir(): %m (%d)\n", errno);
return 1;
}
snprintf(mount_opts, sizeof(mount_opts), "pagesize=%d,size=%d,min_size=%d", HP_SIZE, 2*HP_SIZE, HP_SIZE);
rc = mount("none", HPFS_DIR, "hugetlbfs", 0, mount_opts);
if (0 != rc) {
ERR("mount(): %m (%d)\n", errno);
return 1;
}
fd = open(HPFS_DIR"/memfile_01", O_RDWR|O_CREAT, 0777);
if (fd < 0) {
ERR("open(%s): %m (%d)\n", "memfile_01", errno);
return 1;
}
rc = ftruncate(fd, 2 * HP_SIZE);
if (0 != rc) {
ERR("ftruncate(): %m (%d)\n", errno);
return 1;
}
addr = mmap(NULL, 2 * HP_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
if (MAP_FAILED == addr) {
ERR("mmap(): %m (%d)\n", errno);
return 1;
}
// The file can be closed
rc = close(fd);
if (0 != rc) {
ERR("close(%d): %m (%d)\n", fd, errno);
return 1;
}
pause();
return 0;
} // main
The preceding program must be run as root as it calls mount():
$ gcc mount_tlbfs.c -o mount_tlbfs
$ cat /sys/kernel/mm/hugepages/hugepages-64kB/nr_hugepages
0
$ sudo sh -c "echo 8 > /sys/kernel/mm/hugepages/hugepages-64kB/nr_hugepages"
$ cat /sys/kernel/mm/hugepages/hugepages-64kB/nr_hugepages
8
$ sudo ./mount_tlbfs
In another terminal, the /proc/[pid]/smaps file can be displayed to check the huge page allocation. As soon as the program writes into the huge pages, the Lazy allocation mechanism triggers the effective allocation of the huge pages.
Cf. This article for future details
Early reservation
The huge pages are made with consecutive physical memory pages. The reservation should be done early in the system startup (especially on heavy loaded systems) as the physical memory may be so fragmented that it is sometimes impossible to allocate huge pages afterward. To reserve as early as possible, this can be done on the kernel boot command line:
hugepages=
[HW] Number of HugeTLB pages to allocate at boot.
If this follows hugepagesz (below), it specifies
the number of pages of hugepagesz to be allocated.
If this is the first HugeTLB parameter on the command
line, it specifies the number of pages to allocate for
the default huge page size. See also
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: <integer>
hugepagesz=
[HW] The size of the HugeTLB pages. This is used in
conjunction with hugepages (above) to allocate huge
pages of a specific size at boot. The pair
hugepagesz=X hugepages=Y can be specified once for
each supported huge page size. Huge page sizes are
architecture dependent. See also
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: size[KMG]
transparent_hugepage=
[KNL]
Format: [always|madvise|never]
Can be used to control the default behavior of the system
with respect to transparent hugepages.
See Documentation/admin-guide/mm/transhuge.rst
for more details.
On Raspberry Pi, the boot command line can typically be updated in /boot/cmdline.txt and the current boot command line used by the running kernel can be seen in /proc/cmdline.
N.B.:
This recipe is explained in more details here and here
There is a user space library called libhugetlbfs which offers a layer of abstraction on top of the kernel's hugetlbfs mechanism described here. It comes with library services like get_huge_pages() and accompanying tools like hugectl. The goal of this user space service is to map the heap and text+data segments of STATICALLY linked executables into huge pages (the mapping of dynamically linked programs is not supported). All of this relies on the kernel features described in this answer.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I'd like to verify on any given Linux machine if PCI passthrough is supported. After a bit of googling, I found that I should rather check if IOMMU is supported, and I did so by running:
dmesg | grep IOMMU
If it supports IOMMU (and not IOMMUv2), I would get:
IOMMU
[ 0.000000] DMAR: IOMMU enabled
[ 0.049734] DMAR-IR: IOAPIC id 8 under DRHD base 0xfbffc000 IOMMU 0
[ 0.049735] DMAR-IR: IOAPIC id 9 under DRHD base 0xfbffc000 IOMMU 0
[ 1.286567] AMD IOMMUv2 driver by Joerg Roedel <jroedel#suse.de>
[ 1.286568] AMD IOMMUv2 functionality not available on this system
...where DMAR: IOMMU enabled is what I'm looking for.
Now, if the machine has been running for days without a reboot, that first message [ 0.000000] DMAR: IOMMU enabled might not appear any more in the log with the previous command.
Is there any way to check for IOMMU support when that message disappears from the log?
Since 2014 enabled iommu are registered in /sys (sysfs) special file system as class iommu (documented at ABI/testing/sysfs-class-iommu):
https://patchwork.kernel.org/patch/4345491/ "[2/3] iommu/intel: Make use of IOMMU sysfs support" - June 12, 2014
Register our DRHD IOMMUs, cross link devices, and provide a base set
of attributes for the IOMMU. ...
On a typical desktop system, this provides the following (pruned):
$ find /sys | grep dmar
/sys/devices/virtual/iommu/dmar0
...
/sys/class/iommu/dmar0
/sys/class/iommu/dmar1
The code is iommu_device_create (http://elixir.free-electrons.com/linux/v4.5/ident/iommu_device_create, around 4.5) or iommu_device_sysfs_add (http://elixir.free-electrons.com/linux/v4.11/ident/iommu_device_sysfs_add) in more recent kernels.
/*
* Create an IOMMU device and return a pointer to it. IOMMU specific
* attributes can be provided as an attribute group, allowing a unique
* namespace per IOMMU type.
*/
struct device *iommu_device_create(struct device *parent, void *drvdata,
const struct attribute_group **groups,
const char *fmt, ...)
Registration is done only for enabled IOMMU. DMAR:
if (intel_iommu_enabled) {
iommu->iommu_dev = iommu_device_create(NULL, iommu,
intel_iommu_groups,
"%s", iommu->name);
AMD IOMMU:
static int iommu_init_pci(struct amd_iommu *iommu)
{ ...
if (!iommu->dev)
return -ENODEV;
...
iommu->iommu_dev = iommu_device_create(&iommu->dev->dev, iommu,
amd_iommu_groups, "ivhd%d",
iommu->index);
Intel:
int __init intel_iommu_init(void)
{ ...
pr_info("Intel(R) Virtualization Technology for Directed I/O\n");
...
for_each_active_iommu(iommu, drhd)
iommu->iommu_dev = iommu_device_create(NULL, iommu,
intel_iommu_groups,
"%s", iommu->name);
With 4.11 linux kernel version iommu_device_sysfs_add is referenced in many IOMMU drivers, so checking /sys/class/iommu is better (more universal) way to programmatically detect enabled IOMMU than parsing dmesg output or searching in /var/log/kern.log or /var/log/messages for driver-specific enable messages:
Referenced in 10 files:
drivers/iommu/amd_iommu_init.c, line 1640
drivers/iommu/arm-smmu-v3.c, line 2709
drivers/iommu/arm-smmu.c, line 2163
drivers/iommu/dmar.c, line 1083
drivers/iommu/exynos-iommu.c, line 623
drivers/iommu/intel-iommu.c, line 4878
drivers/iommu/iommu-sysfs.c, line 57
drivers/iommu/msm_iommu.c, line 797
drivers/iommu/mtk_iommu.c, line 581
I am implementing a container which is cloned with a new namespace including mount, pid, user namespaces, etc. The first step the child does is to mount several important points such as /proc, /sys and /tmp using mount system call.
if(::mount("proc", "/proc", "proc", 0, NULL)==-1) {
printf("Failed on mount: %s\n", strerror(errno));
return -1;
}
if(::mount("sysfs", "/sys", "sysfs", 0, NULL)==-1) {
printf("Failed on mount: %s\n", strerror(errno));
return -1;
}
if(::mount("tmp", "/tmp", "tmpfs", 0, NULL)==-1) {
printf("Failed on mount: %s\n", strerror(errno));
return -1;
}
However, I am a bit confused by the source field in the argument list passed to mount.
int mount(const char *source, const char *target,
const char *filesystemtype, unsigned long mountflags,
const void *data);
What does the source mean exactly? For example, mounting /tmp seems have nothing to do with the source char string. I can still see a new /tmp folder created under the new namespace even using ::mount(nullptr, "/tmp", "tmpfs", 0, NULL). Am I missing something?
It is just supposed to match the argument such as those provided in your /etc/fstab file. For instance, on my fstab I have:
# <file system> <mount point> <type> <options> <dump> <pass>
...
proc /proc proc defaults 0 0
sysfs /sys sysfs defaults 0 0
But those examples are a bit different, because of their nature. Indeed, both proc and sysfs are not general filesystem. Hence, if you would have mounted a hard drive, the source would be more straightforward, being /dev/sda1 for instance.
And because you're implementing an isolation on top of namespaces, beware if the container calls umount on /proc for instance. It might reveal the host's proc, thus, breaking the isolation.
To add a bit to Aif`s answer: according to the mount manpage:
mount() attaches the filesystem specified by source (which is often a
pathname referring to a device, but can also be the pathname of a
directory or file, or a dummy string) to the location (a directory or
file) specified by the pathname in target.
In the case of tmpfs, it is very much a dummy string. You are simply creating a temporary file system. tmpfs is stored in volatile memory and is temporary, not really having a source.
For other filesystem types, source will be very important, specifying which filesystem you are mounting to that directory, e.g. /dev/sda1 or what have you.
Does the Linux kernel always provide file descriptors 0, 1 and 2 for the PID 1 process passed on boot with init=... or implicitly /sbin/init(/etc/init, /bin/init, /bin/sh)? Do they refer to the system console /dev/console? What happens if /dev is not provided on init, but has to be set up by the init system?
They're hooked to the console by kernel_init_freeable. The console is opened and duplicated directly without going through /dev.
/* Open the /dev/console on the rootfs, this should never fail */
if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
pr_err("Warning: unable to open an initial console.\n");
(void) sys_dup(0);
(void) sys_dup(0);
One of the books on advanced linux programming states:
The /proc/filesystems entry displays the file system types known to the kernel. Note that this list isn't very useful because it is not complete: File systems can be loaded and unloaded dynamically as kernel modules.The contents of /proc/filesystems list only file system types that either are statically linked into the kernel or are currently loaded. Other file system types may be available on the system as modules but might not be loaded yet.
Now, I have:
➜ ~ ps -C sshfs
PID TTY TIME CMD
8123 ? 00:00:00 sshfs
➜ ~ mount | grep sshfs
root#ss1: on /home/wani/tmp type fuse.sshfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0)
➜ ~
But ...
➜ ~ cat /proc/filesystems | grep sshfs
➜ ~
sshfs is implemented in userspace using the FUSE infrastructure. Userspace filesystems are not known to the kernel as a separate entity. The FUSE kernel-side infrastructure itself, however, is known to the kernel. On my system:
$ cat /proc/filesystems
nodev sysfs
nodev rootfs
nodev ramfs
...
ext4
cramfs
...
nodev fuse
nodev fusectl
...
Note the last two lines; the kernel is aware of a fuse filesystem, which is essentially an adapter interface that lets filesystem services to be provided by userspace processes.