Why does making concurrent random writes to a single file on an NVMe SSD not lead to throughput increases?

Why does making concurrent random writes to a single file on an NVMe SSD not lead to throughput increases? - linux

I've been experimenting with a random write workload where I use multiple threads to write to disjoint offsets within one or more files on an NVMe SSD. I'm using a Linux machine and the writes are synchronous and are made using direct I/O (i.e., the files are opened with O_DSYNC and O_DIRECT).
I noticed that if the threads write concurrently to a single file, the achieved write throughput does not increase when the number of threads increases (i.e., the writes appear to be applied serially and not in parallel). However if each thread writes to its own file, I do get throughput increases (up to the SSD's manufacturer-advertised random write throughput). See the graph below for my throughput measurements.
I was wondering if anyone knows why I'm not able to get throughput increases if I have multiple threads concurrently writing to non-overlapping regions in the same file?
Here are some additional details about my experimental setup.
I'm writing 2 GiB of data (random write) and varying the number of threads used to do the write (from 1 to 16). Each thread writes 4 KiB of data at a time. I'm considering two setups: (1) all threads write to a single file, and (2) each thread writes to its own file. Before starting the benchmark, the file(s) used are opened and are initialized to their final size using fallocate(). The file(s) are opened with O_DIRECT and O_DSYNC. Each thread is assigned a random disjoint subset of the offsets within the file (i.e., the regions the threads write to are non-overlapping). Then, the threads concurrently write to these offsets using pwrite().
Here are the machine's specifications:
Linux 5.9.1-arch1-1
1 TB Intel NVMe SSD (model SSDPE2KX010T8)
ext4 file system
128 GiB of memory
2.10 GHz 20-core Xeon Gold 6230 CPU
The SSD is supposed to be capable of delivering up to 70000 IOPS of random writes.
I've included a standalone C++ program that I've used to reproduce this behavior on my machine. I've been compiling using g++ -O3 -lpthread <file> (I'm using g++ version 10.2.0).
#include <algorithm>
#include <cassert>
#include <chrono>
#include <cstdlib>
#include <cstring>
#include <iostream>
#include <random>
#include <thread>
#include <vector>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
constexpr size_t kBlockSize = 4 * 1024;
constexpr size_t kDataSizeMiB = 2048;
constexpr size_t kDataSize = kDataSizeMiB * 1024 * 1024;
constexpr size_t kBlocksTotal = kDataSize / kBlockSize;
constexpr size_t kRngSeed = 42;
void AllocFiles(unsigned num_files, size_t blocks_per_file,
std::vector<int> &fds,
std::vector<std::vector<size_t>> &write_pos) {
std::mt19937 rng(kRngSeed);
for (unsigned i = 0; i < num_files; ++i) {
const std::string path = "f" + std::to_string(i);
fds.push_back(open(path.c_str(), O_CREAT | O_WRONLY | O_DIRECT | O_DSYNC,
S_IRUSR | S_IWUSR));
write_pos.emplace_back();
auto &file_offsets = write_pos.back();
int fd = fds.back();
for (size_t blk = 0; blk < blocks_per_file; ++blk) {
file_offsets.push_back(blk * kBlockSize);
}
fallocate(fd, /*mode=*/0, /*offset=*/0, blocks_per_file * kBlockSize);
std::shuffle(file_offsets.begin(), file_offsets.end(), rng);
}
}
void ThreadMain(int fd, const void *data, const std::vector<size_t> &write_pos,
size_t offset, size_t num_writes) {
for (size_t i = 0; i < num_writes; ++i) {
pwrite(fd, data, kBlockSize, write_pos[i + offset]);
}
}
int main(int argc, char *argv[]) {
assert(argc == 3);
unsigned num_threads = strtoul(argv[1], nullptr, 10);
unsigned files = strtoul(argv[2], nullptr, 10);
assert(num_threads % files == 0);
assert(num_threads >= files);
assert(kBlocksTotal % num_threads == 0);
void *data_buf;
posix_memalign(&data_buf, 512, kBlockSize);
*reinterpret_cast<uint64_t *>(data_buf) = 0xFFFFFFFFFFFFFFFF;
std::vector<int> fds;
std::vector<std::vector<size_t>> write_pos;
std::vector<std::thread> threads;
const size_t blocks_per_file = kBlocksTotal / files;
const unsigned threads_per_file = num_threads / files;
const unsigned writes_per_thread_per_file =
blocks_per_file / threads_per_file;
AllocFiles(files, blocks_per_file, fds, write_pos);
const auto begin = std::chrono::steady_clock::now();
for (unsigned thread_id = 0; thread_id < num_threads; ++thread_id) {
unsigned thread_file_offset = thread_id / files;
threads.emplace_back(
&ThreadMain, fds[thread_id % files], data_buf,
write_pos[thread_id % files],
/*offset=*/(thread_file_offset * writes_per_thread_per_file),
/*num_writes=*/writes_per_thread_per_file);
}
for (auto &thread : threads) {
thread.join();
}
const auto end = std::chrono::steady_clock::now();
for (const auto &fd : fds) {
close(fd);
}
std::cout << kDataSizeMiB /
std::chrono::duration_cast<std::chrono::duration<double>>(
end - begin)
.count()
<< std::endl;
free(data_buf);
return 0;
}

In this scenario, the underlying reason was that ext4 was taking an exclusive lock when writing to the file. To get the multithreaded throughput scaling that we would expect when writing to the same file, I needed to make two changes:
The file needed to be "preallocated." This means that we need to make at least one actual write to every block in the file that we plan on writing to (e.g., writing zeros to the whole file).
The buffer used for making the write needs to be aligned to the file system's block size. In my case the buffer should have been aligned to 4096.
// What I had
posix_memalign(&data_buf, 512, kBlockSize);
// What I actually needed
posix_memalign(&data_buf, 4096, kBlockSize);
With these changes, using multiple threads to make non-overlapping random writes to a single file leads to the same throughput gains as if the threads each wrote to their own file.

Related

pthreads code not scaling up

I wrote the following very simple pthread code to test how it scales up. I am running the code on a machine with 8 logical processors and at no time do I create more than 8 threads (to avoid context switching).
With increasing number of threads, each thread has to do lesser amount of work. Also, it is evident from the code that there are no shared Data structures between the threads which might be a bottleneck. But still, my performance degrades as I increase the number of threads.
Can somebody tell me what am I doing wrong here.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int NUM_THREADS = 3;
unsigned long int COUNTER = 10000000000000;
unsigned long int LOOP_INDEX;
void* addNum(void *data)
{
unsigned long int sum = 0;
for(unsigned long int i = 0; i < LOOP_INDEX; i++) {
sum += 100;
}
return NULL;
}
int main(int argc, char** argv)
{
NUM_THREADS = atoi(argv[1]);
pthread_t *threads = (pthread_t*)malloc(sizeof(pthread_t) * NUM_THREADS);
int rc;
clock_t start, diff;
LOOP_INDEX = COUNTER/NUM_THREADS;
start = clock();
for (int t = 0; t < NUM_THREADS; t++) {
rc = pthread_create((threads + t), NULL, addNum, NULL);
if (rc) {
printf("ERROR; return code from pthread_create() is %d", rc);
exit(-1);
}
}
void *status;
for (int t = 0; t < NUM_THREADS; t++) {
rc = pthread_join(threads[t], &status);
}
diff = clock() - start;
int sec = diff / CLOCKS_PER_SEC;
printf("%d",sec);
}
Note: All the answers I found online said that the overhead of creating the threads is more than the work they are doing. To test it, I commented out everything in the "addNum()" function. But then, after doing that no matter how many threads I create, the time taken by the code is 0 seconds. So there is no overhead as such, I think.

clock() counts CPU time used, across all threads. So all that's telling you is that you're using a little bit more total CPU time, which is exactly what you would expect.
It's the total wall clock elapsed time which should be going down if your parallelisation is effective. Measure that with clock_gettime() specifying the CLOCK_MONOTONIC clock instead of clock().

Why does my process take too long to die?

Basically I'm using Linux 2.6.34 on PowerPC (Freescale e500mc). I have a process (a kind of VM that was developed in-house) that uses about 2.25 G of mlocked VM. When I kill it, I notice that it takes upwards of 2 minutes to terminate.
I investigated a little. First, I closed all open file descriptors but that didn't seem to make a difference. Then I added some printk in the kernel and through it I found that all delay comes from the kernel unlocking my VMAs. The delay is uniform across pages, which I verified by repeatedly checking the locked page count in /proc/meminfo. I've checked with programs that allocate that much memory and they all die as soon as I signal them.
What do you think I should check now? Thanks for your replies.
Edit: I had to find a way to share more information about the problem so I wrote this below program:
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
#include <signal.h>
#include <sys/time.h>
#define MAP_PERM_1 (PROT_WRITE | PROT_READ | PROT_EXEC)
#define MAP_PERM_2 (PROT_WRITE | PROT_READ)
#define MAP_FLAGS (MAP_ANONYMOUS | MAP_FIXED | MAP_PRIVATE)
#define PG_LEN 4096
#define align_pg_32(addr) (addr & 0xFFFFF000)
#define num_pg_in_range(start, end) ((end - start + 1) >> 12)
inline void __force_pgtbl_alloc(unsigned int start)
{
volatile int *s = (int *) start;
*s = *s;
}
int __map_a_page_at(unsigned int start, int whichperm)
{
int perm = whichperm ? MAP_PERM_1 : MAP_PERM_2;
if(MAP_FAILED == mmap((void *)start, PG_LEN, perm, MAP_FLAGS, 0, 0)){
fprintf(stderr,
"mmap failed at 0x%x: %s.\n",
start, strerror(errno));
return 0;
}
return 1;
}
int __mlock_page(unsigned int addr)
{
if (mlock((void *)addr, (size_t)PG_LEN) < 0){
fprintf(stderr,
"mlock failed on page: 0x%x: %s.\n",
addr, strerror(errno));
return 0;
}
return 1;
}
void sigint_handler(int p)
{
struct timeval start = {0 ,0}, end = {0, 0}, diff = {0, 0};
gettimeofday(&start, NULL);
munlockall();
gettimeofday(&end, NULL);
timersub(&end, &start, &diff);
printf("Munlock'd entire VM in %u secs %u usecs.\n",
diff.tv_sec, diff.tv_usec);
exit(0);
}
int make_vma_map(unsigned int start, unsigned int end)
{
int num_pg = num_pg_in_range(start, end);
if (end < start){
fprintf(stderr,
"Bad range: start: 0x%x end: 0x%x.\n",
start, end);
return 0;
}
for (; num_pg; num_pg --, start += PG_LEN){
if (__map_a_page_at(start, num_pg % 2) && __mlock_page(start))
__force_pgtbl_alloc(start);
else
return 0;
}
return 1;
}
void display_banner()
{
printf("-----------------------------------------\n");
printf("Virtual memory allocator. Ctrl+C to exit.\n");
printf("-----------------------------------------\n");
}
int main()
{
unsigned int vma_start, vma_end, input = 0;
int start_end = 0; // 0: start; 1: end;
display_banner();
// Bind SIGINT handler.
signal(SIGINT, sigint_handler);
while (1){
if (!start_end)
printf("start:\t");
else
printf("end:\t");
scanf("%i", &input);
if (start_end){
vma_end = align_pg_32(input);
make_vma_map(vma_start, vma_end);
}
else{
vma_start = align_pg_32(input);
}
start_end = !start_end;
}
return 0;
}
As you would see, the program accepts ranges of virtual addresses, each range being defined by start and end. Each range is then further subdivided into page-sized VMAs by giving different permissions to adjacent pages. Interrupting (using SIGINT) the program triggers a call to munlockall() and the time for said procedure to complete is duly noted.
Now, when I run it on freescale e500mc with Linux version at 2.6.34 over the range 0x30000000-0x35000000, I get a total munlockall() time of almost 45 seconds. However, if I do the same thing with smaller start-end ranges in random orders (that is, not necessarily increasing addresses) such that the total number of pages (and locked VMAs) is roughly the same, observe total munlockall() time to be no more than 4 seconds.
I tried the same thing on x86_64 with Linux 2.6.34 and my program compiled against the -m32 parameter and it seems the variations, though not so pronounced as with ppc, are still 8 seconds for the first case and under a second for the second case.
I tried the program on Linux 2.6.10 on the one end and on 3.19, on the other and it seems these monumental differences don't exist there. What's more, munlockall() always completes at under a second.
So, it seems that the problem, whatever it is, exists only around the 2.6.34 version of the Linux kernel.

You said the VM was developed in-house. Does this mean you have access to the source? I would start by checking to see if it has anything to stop it from immediately terminating to avoid data loss.
Otherwise, could you potentially try to provide more information? You may also want to check out: https://unix.stackexchange.com/ as they would be better suited to help with any issues the linux kernel may be having.

How to change kernel timer frequency?

I wanted to change kernel option on kernel timer frequency.
So i found this, it is saying that i can change the configuration via /boot/config-'uname -r'
(And i also found the post saying unless it builds a tickless kernel - CONFIG_NO_HZ=y i couldn't change timer frequency but mine is set to CONFIG_NO_HZ=y)
And it is also mentioning how to calculate the frequency with C code.
So first i check for current kernel timer frequency with the C code.
The result is 1000~1500 Hz.
And i check /boot/config-'uname -r', it represents like below.
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
But at there timer frequency was 250 Hz...?
And then in order to check more, i try to modify the file to
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
And reboot, check again the config file if the change is applied, and run the C code which checks timer frequency approximately.
But result was same as before.
What is a problem ???
My environment is VMware, ubuntu12.04
The below is the C code.
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#define USECREQ 250
#define LOOPS 1000
void event_handler (int signum)
{
static unsigned long cnt = 0;
static struct timeval tsFirst;
if (cnt == 0) {
gettimeofday (&tsFirst, 0);
}
cnt ++;
if (cnt >= LOOPS) {
struct timeval tsNow;
struct timeval diff;
setitimer (ITIMER_REAL, NULL, NULL);
gettimeofday (&tsNow, 0);
timersub(&tsNow, &tsFirst, &diff);
unsigned long long udiff = (diff.tv_sec * 1000000) + diff.tv_usec;
double delta = (double)(udiff/cnt)/1000000;
int hz = (unsigned)(1.0/delta);
printf ("kernel timer interrupt frequency is approx. %d Hz", hz);
if (hz >= (int) (1.0/((double)(USECREQ)/1000000))) {
printf (" or higher");
}
printf ("\n");
exit (0);
}
}
int main (int argc, char **argv)
{
struct sigaction sa;
struct itimerval timer;
memset (&sa, 0, sizeof (sa));
sa.sa_handler = &event_handler;
sigaction (SIGALRM, &sa, NULL);
timer.it_value.tv_sec = 0;
timer.it_value.tv_usec = USECREQ;
timer.it_interval.tv_sec = 0;
timer.it_interval.tv_usec = USECREQ;
setitimer (ITIMER_REAL, &timer, NULL);
while (1);
}

Changes you make to /boot/config do not affect the running kernel. Please read more about the kernel config file here.
The config file you see in /boot/config (actually, it's more like config-[kernel_version]) is the config file that was USED to build the kernel. This means that every change you make to this config file does not affect anything.
To really make these changes you need to construct a new config file, with the modifications you require and compile and install a new kernel based on that config file. You can use the config file from /boot and just make the clock adjustments to fit.

Non collective write using in file view

When trying to write blocks to a file, with my blocks being unevenly distributed across my processes, one can use MPI_File_write_at with the good offset. As this function is not a collective operation, this works well.
Exemple :
#include <cstdio>
#include <cstdlib>
#include <string>
#include <mpi.h>
int main(int argc, char* argv[])
{
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int global = 7; // prime helps have unbalanced procs
int local = (global/size) + (global%size>rank?1:0);
int strsize = 5;
MPI_File fh;
MPI_File_open(MPI_COMM_WORLD, "output.txt", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);
for (int i=0; i<local; ++i)
{
size_t idx = i * size + rank;
std::string buffer = std::string(strsize, 'a' + idx);
size_t offset = buffer.size() * idx;
MPI_File_write_at(fh, offset, buffer.c_str(), buffer.size(), MPI_CHAR, MPI_STATUS_IGNORE);
}
MPI_File_close(&fh);
MPI_Finalize();
return 0;
}
However for more complexe write, particularly when writting multi dimensional data like raw images, one may want to create a view at the file with MPI_Type_create_subarray. However, when using this methods with simple MPI_File_write (which is suppose to be non collective) I run in deadlocks. Exemple :
#include <cstdio>
#include <cstdlib>
#include <string>
#include <mpi.h>
int main(int argc, char* argv[])
{
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int global = 7; // prime helps have unbalanced procs
int local = (global/size) + (global%size>rank?1:0);
int strsize = 5;
MPI_File fh;
MPI_File_open(MPI_COMM_WORLD, "output.txt", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);
for (int i=0; i<local; ++i)
{
size_t idx = i * size + rank;
std::string buffer = std::string(strsize, 'a' + idx);
int dim = 2;
int gsizes[2] = { buffer.size(), global };
int lsizes[2] = { buffer.size(), 1 };
int offset[2] = { 0, idx };
MPI_Datatype filetype;
MPI_Type_create_subarray(dim, gsizes, lsizes, offset, MPI_ORDER_C, MPI_CHAR, &filetype);
MPI_Type_commit(&filetype);
MPI_File_set_view(fh, 0, MPI_CHAR, filetype, "native", MPI_INFO_NULL);
MPI_File_write(fh, buffer.c_str(), buffer.size(), MPI_CHAR, MPI_STATUS_IGNORE);
}
MPI_File_close(&fh);
MPI_Finalize();
return 0;
}
How to avoid such a code to lock ? Keep in mind that by real code will really use the multidimensional capabilities of MPI_Type_create_subarray and cannot just use MPI_File_write_at
Also, it is difficult for me to know the maximum number of block in a process, so I'd like to avoid doing a reduce_all and then loop on the max number of block with empty writes when localnb <= id < maxnb

You don't use MPI_REDUCE when you have a variable number of blocks per node. You use MPI_SCAN or MPI_EXSCAN: MPI IO Writing a file when offset is not known
MPI_File_set_view is collective, so if 'local' is different on each processor, you'll find yourself calling a collective routine from less than all processors in the communicator. If you really really need to do so, open the file with MPI_COMM_SELF.
the MPI_SCAN approach means each process can set the file view as needed, and then blammo you can call the collective MPI_File_write_at_all (even if some processes have zero work -- they still need to participate) and take advantage of whatever clever optimizations your MPI-IO implementation provides.

Getting stack traces on Unix systems, automatically

What methods are there for automatically getting a stack trace on Unix systems? I don't mean just getting a core file or attaching interactively with GDB, but having a SIGSEGV handler that dumps a backtrace to a text file.
Bonus points for the following optional features:
Extra information gathering at crash time (eg. config files).
Email a crash info bundle to the developers.
Ability to add this in a dlopened shared library
Not requiring a GUI

FYI,
the suggested solution (using backtrace_symbols in a signal handler) is dangerously broken. DO NOT USE IT -
Yes, backtrace and backtrace_symbols will produce a backtrace and a translate it to symbolic names, however:
backtrace_symbols allocates memory using malloc and you use free to free it - If you're crashing because of memory corruption your malloc arena is very likely to be corrupt and cause a double fault.
malloc and free protect the malloc arena with a lock internally. You might have faulted in the middle of a malloc/free with the lock taken, which will cause these function or anything that calls them to dead lock.
You use puts which uses the standard stream, which is also protected by a lock. If you faulted in the middle of a printf you once again have a deadlock.
On 32bit platforms (e.g. your normal PC of 2 year ago), the kernel will plant a return address to an internal glibc function instead of your faulting function in your stack, so the single most important piece of information you are interested in - in which function did the program fault, will actually be corrupted on those platform.
So, the code in the example is the worst kind of wrong - it LOOKS like it's working, but it will really fail you in unexpected ways in production.
BTW, interested in doing it right? check this out.
Cheers,
Gilad.

If you are on systems with the BSD backtrace functionality available (Linux, OSX 1.5, BSD of course), you can do this programmatically in your signal handler.
For example (backtrace code derived from IBM example):
#include <execinfo.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
void sig_handler(int sig)
{
void * array[25];
int nSize = backtrace(array, 25);
char ** symbols = backtrace_symbols(array, nSize);
for (int i = 0; i < nSize; i++)
{
puts(symbols[i]);;
}
free(symbols);
signal(sig, &sig_handler);
}
void h()
{
kill(0, SIGSEGV);
}
void g()
{
h();
}
void f()
{
g();
}
int main(int argc, char ** argv)
{
signal(SIGSEGV, &sig_handler);
f();
}
Output:
0 a.out 0x00001f2d sig_handler + 35
1 libSystem.B.dylib 0x95f8f09b _sigtramp + 43
2 ??? 0xffffffff 0x0 + 4294967295
3 a.out 0x00001fb1 h + 26
4 a.out 0x00001fbe g + 11
5 a.out 0x00001fcb f + 11
6 a.out 0x00001ff5 main + 40
7 a.out 0x00001ede start + 54
This doesn't get bonus points for the optional features (except not requiring a GUI), however, it does have the advantage of being very simple, and not requiring any additional libraries or programs.

Here is an example of how to get some more info using a demangler. As you can see this one also logs the stacktrace to file.
#include <iostream>
#include <sstream>
#include <string>
#include <fstream>
#include <cxxabi.h>
void sig_handler(int sig)
{
std::stringstream stream;
void * array[25];
int nSize = backtrace(array, 25);
char ** symbols = backtrace_symbols(array, nSize);
for (unsigned int i = 0; i < size; i++) {
int status;
char *realname;
std::string current = symbols[i];
size_t start = current.find("(");
size_t end = current.find("+");
realname = NULL;
if (start != std::string::npos && end != std::string::npos) {
std::string symbol = current.substr(start+1, end-start-1);
realname = abi::__cxa_demangle(symbol.c_str(), 0, 0, &status);
}
if (realname != NULL)
stream << realname << std::endl;
else
stream << symbols[i] << std::endl;
free(realname);
}
free(symbols);
std::cerr << stream.str();
std::ofstream file("/tmp/error.log");
if (file.is_open()) {
if (file.good())
file << stream.str();
file.close();
}
signal(sig, &sig_handler);
}

Dereks solution is probably the best, but here's an alternative anyway:
Recent Linux kernel version allow you to pipe core dumps to a script or program. You could write a script to catch the core dump, collect any extra information you need and mail everything back.
This is a global setting though, so it'd apply to any crashing program on the system. It will also require root rights to set up.
It can be configured through the /proc/sys/kernel/core_pattern file. Set that to something like ' | /home/myuser/bin/my-core-handler-script'.
The Ubuntu people use this feature as well.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why does making concurrent random writes to a single file on an NVMe SSD not lead to throughput increases? - linux

Related

pthreads code not scaling up

Why does my process take too long to die?

How to change kernel timer frequency?

Non collective write using in file view

Getting stack traces on Unix systems, automatically

Categories

Resources