Detect block size for quota in Linux - linux

The limit placed on disk quota in Linux is counted in blocks. However, I found no reliable way to determine the block size. Tutorials I found refer to block size as 512 bytes, and sometimes as 1024 bytes.
I got confused reading a post on LinuxForum.org for what a block size really means. So I tried to find that meaning in the context of quota.
I found a "Determine the block size on hard disk filesystem for disk quota" tip on NixCraft, that suggested the command:
dumpe2fs /dev/sdXN | grep -i 'Block size'
or
blockdev --getbsz /dev/sdXN
But on my system those commands returned 4096, and when I checked the real quota block size on the same system, I got a block size of 1024 bytes.
Is there a scriptable way to determine the quota block size on a device, short of creating a known sized file, and checking it's quota usage?

The filesystem blocksize and the quota blocksize are potentially different. The quota blocksize is given by the BLOCK_SIZE macro defined in <sys/mount.h> (/usr/include/sys/mount.h):
#ifndef _SYS_MOUNT_H
#define _SYS_MOUNT_H 1
#include <features.h>
#include <sys/ioctl.h>
#define BLOCK_SIZE 1024
#define BLOCK_SIZE_BITS 10
...
The filesystem blocksize for a given filesystem is returned by the statvfs call:
#include <stdio.h>
#include <sys/statvfs.h>
int main(int argc, char *argv[])
{
char *fn;
struct statvfs vfs;
if (argc > 1)
fn = argv[1];
else
fn = argv[0];
if (statvfs(fn, &vfs))
{
perror("statvfs");
return 1;
}
printf("(%s) bsize: %lu\n", fn, vfs.f_bsize);
return 0;
}
The <sys/quota.h> header includes a convenience macro to convert filesystem blocks to disk quota blocks:
/*
* Convert count of filesystem blocks to diskquota blocks, meant
* for filesystems where i_blksize != BLOCK_SIZE
*/
#define fs_to_dq_blocks(num, blksize) (((num) * (blksize)) / BLOCK_SIZE)

Related

Why does making concurrent random writes to a single file on an NVMe SSD not lead to throughput increases?

I've been experimenting with a random write workload where I use multiple threads to write to disjoint offsets within one or more files on an NVMe SSD. I'm using a Linux machine and the writes are synchronous and are made using direct I/O (i.e., the files are opened with O_DSYNC and O_DIRECT).
I noticed that if the threads write concurrently to a single file, the achieved write throughput does not increase when the number of threads increases (i.e., the writes appear to be applied serially and not in parallel). However if each thread writes to its own file, I do get throughput increases (up to the SSD's manufacturer-advertised random write throughput). See the graph below for my throughput measurements.
I was wondering if anyone knows why I'm not able to get throughput increases if I have multiple threads concurrently writing to non-overlapping regions in the same file?
Here are some additional details about my experimental setup.
I'm writing 2 GiB of data (random write) and varying the number of threads used to do the write (from 1 to 16). Each thread writes 4 KiB of data at a time. I'm considering two setups: (1) all threads write to a single file, and (2) each thread writes to its own file. Before starting the benchmark, the file(s) used are opened and are initialized to their final size using fallocate(). The file(s) are opened with O_DIRECT and O_DSYNC. Each thread is assigned a random disjoint subset of the offsets within the file (i.e., the regions the threads write to are non-overlapping). Then, the threads concurrently write to these offsets using pwrite().
Here are the machine's specifications:
Linux 5.9.1-arch1-1
1 TB Intel NVMe SSD (model SSDPE2KX010T8)
ext4 file system
128 GiB of memory
2.10 GHz 20-core Xeon Gold 6230 CPU
The SSD is supposed to be capable of delivering up to 70000 IOPS of random writes.
I've included a standalone C++ program that I've used to reproduce this behavior on my machine. I've been compiling using g++ -O3 -lpthread <file> (I'm using g++ version 10.2.0).
#include <algorithm>
#include <cassert>
#include <chrono>
#include <cstdlib>
#include <cstring>
#include <iostream>
#include <random>
#include <thread>
#include <vector>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
constexpr size_t kBlockSize = 4 * 1024;
constexpr size_t kDataSizeMiB = 2048;
constexpr size_t kDataSize = kDataSizeMiB * 1024 * 1024;
constexpr size_t kBlocksTotal = kDataSize / kBlockSize;
constexpr size_t kRngSeed = 42;
void AllocFiles(unsigned num_files, size_t blocks_per_file,
std::vector<int> &fds,
std::vector<std::vector<size_t>> &write_pos) {
std::mt19937 rng(kRngSeed);
for (unsigned i = 0; i < num_files; ++i) {
const std::string path = "f" + std::to_string(i);
fds.push_back(open(path.c_str(), O_CREAT | O_WRONLY | O_DIRECT | O_DSYNC,
S_IRUSR | S_IWUSR));
write_pos.emplace_back();
auto &file_offsets = write_pos.back();
int fd = fds.back();
for (size_t blk = 0; blk < blocks_per_file; ++blk) {
file_offsets.push_back(blk * kBlockSize);
}
fallocate(fd, /*mode=*/0, /*offset=*/0, blocks_per_file * kBlockSize);
std::shuffle(file_offsets.begin(), file_offsets.end(), rng);
}
}
void ThreadMain(int fd, const void *data, const std::vector<size_t> &write_pos,
size_t offset, size_t num_writes) {
for (size_t i = 0; i < num_writes; ++i) {
pwrite(fd, data, kBlockSize, write_pos[i + offset]);
}
}
int main(int argc, char *argv[]) {
assert(argc == 3);
unsigned num_threads = strtoul(argv[1], nullptr, 10);
unsigned files = strtoul(argv[2], nullptr, 10);
assert(num_threads % files == 0);
assert(num_threads >= files);
assert(kBlocksTotal % num_threads == 0);
void *data_buf;
posix_memalign(&data_buf, 512, kBlockSize);
*reinterpret_cast<uint64_t *>(data_buf) = 0xFFFFFFFFFFFFFFFF;
std::vector<int> fds;
std::vector<std::vector<size_t>> write_pos;
std::vector<std::thread> threads;
const size_t blocks_per_file = kBlocksTotal / files;
const unsigned threads_per_file = num_threads / files;
const unsigned writes_per_thread_per_file =
blocks_per_file / threads_per_file;
AllocFiles(files, blocks_per_file, fds, write_pos);
const auto begin = std::chrono::steady_clock::now();
for (unsigned thread_id = 0; thread_id < num_threads; ++thread_id) {
unsigned thread_file_offset = thread_id / files;
threads.emplace_back(
&ThreadMain, fds[thread_id % files], data_buf,
write_pos[thread_id % files],
/*offset=*/(thread_file_offset * writes_per_thread_per_file),
/*num_writes=*/writes_per_thread_per_file);
}
for (auto &thread : threads) {
thread.join();
}
const auto end = std::chrono::steady_clock::now();
for (const auto &fd : fds) {
close(fd);
}
std::cout << kDataSizeMiB /
std::chrono::duration_cast<std::chrono::duration<double>>(
end - begin)
.count()
<< std::endl;
free(data_buf);
return 0;
}
In this scenario, the underlying reason was that ext4 was taking an exclusive lock when writing to the file. To get the multithreaded throughput scaling that we would expect when writing to the same file, I needed to make two changes:
The file needed to be "preallocated." This means that we need to make at least one actual write to every block in the file that we plan on writing to (e.g., writing zeros to the whole file).
The buffer used for making the write needs to be aligned to the file system's block size. In my case the buffer should have been aligned to 4096.
// What I had
posix_memalign(&data_buf, 512, kBlockSize);
// What I actually needed
posix_memalign(&data_buf, 4096, kBlockSize);
With these changes, using multiple threads to make non-overlapping random writes to a single file leads to the same throughput gains as if the threads each wrote to their own file.

munmap() failure with ENOMEM with private anonymous mapping

I have recently discovered that Linux does not guarantee that memory allocated with mmap can be freed with munmap if this leads to situation when number of VMA (Virtual Memory Area) structures exceed vm.max_map_count. Manpage states this (almost) clearly:
ENOMEM The process's maximum number of mappings would have been exceeded.
This error can also occur for munmap(), when unmapping a region
in the middle of an existing mapping, since this results in two
smaller mappings on either side of the region being unmapped.
The problem is that Linux kernel always tries to merge VMA structures if possible, making munmap fail even for separately created mappings. I was able to write a small program to confirm this behavior:
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/mman.h>
// value of vm.max_map_count
#define VM_MAX_MAP_COUNT (65530)
// number of vma for the empty process linked against libc - /proc/<id>/maps
#define VMA_PREMAPPED (15)
#define VMA_SIZE (4096)
#define VMA_COUNT ((VM_MAX_MAP_COUNT - VMA_PREMAPPED) * 2)
int main(void)
{
static void *vma[VMA_COUNT];
for (int i = 0; i < VMA_COUNT; i++) {
vma[i] = mmap(0, VMA_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (vma[i] == MAP_FAILED) {
printf("mmap() failed at %d\n", i);
return 1;
}
}
for (int i = 0; i < VMA_COUNT; i += 2) {
if (munmap(vma[i], VMA_SIZE) != 0) {
printf("munmap() failed at %d (%p): %m\n", i, vma[i]);
}
}
}
It allocates a large number of pages (twice the default allowed maximum) using mmap, then munmaps every second page to create separate VMA structure for each remaining page. On my machine the last munmap call always fails with ENOMEM.
Initially I thought that munmap never fails if used with the same values for address and size that were used to create mapping. Apparently this is not the case on Linux and I was not able to find information about similar behavior on other systems.
At the same time in my opinion partial unmapping applied to the middle of a mapped region is expected to fail on any OS for every sane implementation, but I haven't found any documentation that says such failure is possible.
I would usually consider this a bug in the kernel, but knowing how Linux deals with memory overcommit and OOM I am almost sure this is a "feature" that exists to improve performance and decrease memory consumption.
Other information I was able to find:
Similar APIs on Windows do not have this "feature" due to their design (see MapViewOfFile, UnmapViewOfFile, VirtualAlloc, VirtualFree) - they simply do not support partial unmapping.
glibc malloc implementation does not create more than 65535 mappings, backing off to sbrk when this limit is reached: https://code.woboq.org/userspace/glibc/malloc/malloc.c.html. This looks like a workaround for this issue, but it is still possible to make free silently leak memory.
jemalloc had trouble with this and tried to avoid using mmap/munmap because of this issue (I don't know how it ended for them).
Do other OS's really guarantee deallocation of memory mappings? I know Windows does this, but what about other Unix-like operating systems? FreeBSD? QNX?
EDIT: I am adding example that shows how glibc's free can leak memory when internal munmap call fails with ENOMEM. Use strace to see that munmap fails:
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/mman.h>
// value of vm.max_map_count
#define VM_MAX_MAP_COUNT (65530)
#define VMA_MMAP_SIZE (4096)
#define VMA_MMAP_COUNT (VM_MAX_MAP_COUNT)
// glibc's malloc default mmap_threshold is 128 KiB
#define VMA_MALLOC_SIZE (128 * 1024)
#define VMA_MALLOC_COUNT (VM_MAX_MAP_COUNT)
int main(void)
{
static void *mmap_vma[VMA_MMAP_COUNT];
for (int i = 0; i < VMA_MMAP_COUNT; i++) {
mmap_vma[i] = mmap(0, VMA_MMAP_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (mmap_vma[i] == MAP_FAILED) {
printf("mmap() failed at %d\n", i);
return 1;
}
}
for (int i = 0; i < VMA_MMAP_COUNT; i += 2) {
if (munmap(mmap_vma[i], VMA_MMAP_SIZE) != 0) {
printf("munmap() failed at %d (%p): %m\n", i, mmap_vma[i]);
return 1;
}
}
static void *malloc_vma[VMA_MALLOC_COUNT];
for (int i = 0; i < VMA_MALLOC_COUNT; i++) {
malloc_vma[i] = malloc(VMA_MALLOC_SIZE);
if (malloc_vma[i] == NULL) {
printf("malloc() failed at %d\n", i);
return 1;
}
}
for (int i = 0; i < VMA_MALLOC_COUNT; i += 2) {
free(malloc_vma[i]);
}
}
One way to work around this problem on Linux is to mmap more that 1 page at once (e.g. 1 MB at a time), and also map a separator page after it. So, you actually call mmap on 257 pages of memory, then remap the last page with PROT_NONE, so that it cannot be accessed. This should defeat the VMA merging optimization in the kernel. Since you are allocating many pages at once, you should not run into the max mapping limit. The downside is you have to manually manage how you want to slice the large mmap.
As to your questions:
System calls can fail on any system for a variety of reasons. Documentation is not always complete.
You are allowed to munmap a part of a mmapd region as long as the address passed in lies on a page boundary, and the length argument is rounded up to the next multiple of the page size.

Linux file descriptor table and vmalloc

I see that the Linux kernel uses vmalloc to allocate memory for fdtable when it's bigger than a certain threshold. I would like to know when this happens and have some more clear information.
static void *alloc_fdmem(size_t size)
{
/*
* Very large allocations can stress page reclaim, so fall back to
* vmalloc() if the allocation size will be considered "large" by the VM.
*/
if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
if (data != NULL)
return data;
}
return vmalloc(size);
}
alloc_fdmem is called from alloc_fdtable and the last function is called from expand_fdtable
I wrote this code to print the size.
#include <stdio.h>
#define PAGE_ALLOC_COSTLY_ORDER 3
#define PAGE_SIZE 4096
int main(){
printf("\t%d\n", PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
}
Output
./printo
32768
So, how many files does it take for the kernel to switch to using vmalloc to allocate fdtable?
So PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER is 32768
This is called like:
data = alloc_fdmem(nr * sizeof(struct file *));
i.e. it's used to store struct file pointers.
If your pointers are 4 bytes, it happens when your have 32768/4 = 8192 open files, if your pointers are 8 bytes, it happens at 4096 open files.

malloc large memory never returns NULL

when I run this, it seems to have no problem with keep allocating memory with cnt going over thousands. I don't understand why -- aren't I supposed to get a NULL at some point? Thanks!
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
int main(void)
{
long C = pow(10, 9);
int cnt = 0;
int conversion = 8 * 1024 * 1024;
int *p;
while (1)
{
p = (int *)malloc(C * sizeof(int));
if (p != NULL)
cnt++;
else break;
if (cnt % 10 == 0)
printf("number of successful malloc is %d with %ld Mb\n", cnt, cnt * C / conversion);
}
return 0;
}
Are you running this on Linux? Linux has a highly surprising feature known as overcommit. It doesn't actually allocate memory when you call malloc(), but rather when you actually use that memory. malloc() will happily let you allocate as much memory as your heart desires, never returning a NULL pointer.
It's only when you actually access the memory that Linux takes you seriously and goes out searching for free memory to give you. Of course there may not actually be enough memory to meet the promise it gave your program. You say, "Give me 8GB," and malloc() says, "Sure." Then you try to write to your pointer and Linux says, "Oops! I lied. How bout I just kill off processes (probably yours) until I I free up enough memory?"
You're allocating virtual memory. On a 64-bit OS, virtual memory is available in almost unlimited supply.

Create sparse file with alternate data and hole on ext3 and XFS

I created 1 program to create sparse file which contains alternate empty blocks and data blocks.
For example block1=empty, block2=data, block3=empty .....
#define BLOCK_SIZE 4096
void *buf;
int main(int argc, char **argv)
{
buf=malloc(512);
memset(buf,"a",512);
int fd=0;
int i;
int sector_per_block=BLOCK_SIZE/512;
int block_count=4;
if(argc !=2 ){
printf("Wrong usage\n USAGE: program absolute_path_to_write\n");
_exit(-1);
}
fd=open(argv[1],O_RDWR | O_CREAT,0666);
if(fd <= 0){
printf("file open failed\n");
_exit(0);
}
while(block_count > 0){
lseek(fd,BLOCK_SIZE,SEEK_CUR);
block_count--;
for(i=0;i<sector_per_block;i++)
write(fd,buf,512);
block_count--;
}
close(fd);
return 0;
}
Suppose, I create a new_sparse_file using this above code.
When I run this program, on ext3 FS with block size 4KB, ls -lh shows size of new_sparse_file as 16KB, while du -h shows 8 kb, which, I think is correct.
On xfs, block size of 4kb, ls -lh shows 16KB but `du -h* shows 12kb.
Why are there different kinds of behavior?
This is a bug in XFS :
http://oss.sgi.com/archives/xfs/2011-06/msg00225.html
Sparse files is just an optimization of space, and any FS may not to create a sparse file to optimize file fragmentation and file access speed. So, you can't depend on how spare the file will be on some FS.
So, 12kb on XFS is correct too.

Resources