I wanted to change kernel option on kernel timer frequency.
So i found this, it is saying that i can change the configuration via /boot/config-'uname -r'
(And i also found the post saying unless it builds a tickless kernel - CONFIG_NO_HZ=y i couldn't change timer frequency but mine is set to CONFIG_NO_HZ=y)
And it is also mentioning how to calculate the frequency with C code.
So first i check for current kernel timer frequency with the C code.
The result is 1000~1500 Hz.
And i check /boot/config-'uname -r', it represents like below.
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
But at there timer frequency was 250 Hz...?
And then in order to check more, i try to modify the file to
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
And reboot, check again the config file if the change is applied, and run the C code which checks timer frequency approximately.
But result was same as before.
What is a problem ???
My environment is VMware, ubuntu12.04
The below is the C code.
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#define USECREQ 250
#define LOOPS 1000
void event_handler (int signum)
{
static unsigned long cnt = 0;
static struct timeval tsFirst;
if (cnt == 0) {
gettimeofday (&tsFirst, 0);
}
cnt ++;
if (cnt >= LOOPS) {
struct timeval tsNow;
struct timeval diff;
setitimer (ITIMER_REAL, NULL, NULL);
gettimeofday (&tsNow, 0);
timersub(&tsNow, &tsFirst, &diff);
unsigned long long udiff = (diff.tv_sec * 1000000) + diff.tv_usec;
double delta = (double)(udiff/cnt)/1000000;
int hz = (unsigned)(1.0/delta);
printf ("kernel timer interrupt frequency is approx. %d Hz", hz);
if (hz >= (int) (1.0/((double)(USECREQ)/1000000))) {
printf (" or higher");
}
printf ("\n");
exit (0);
}
}
int main (int argc, char **argv)
{
struct sigaction sa;
struct itimerval timer;
memset (&sa, 0, sizeof (sa));
sa.sa_handler = &event_handler;
sigaction (SIGALRM, &sa, NULL);
timer.it_value.tv_sec = 0;
timer.it_value.tv_usec = USECREQ;
timer.it_interval.tv_sec = 0;
timer.it_interval.tv_usec = USECREQ;
setitimer (ITIMER_REAL, &timer, NULL);
while (1);
}
Changes you make to /boot/config do not affect the running kernel. Please read more about the kernel config file here.
The config file you see in /boot/config (actually, it's more like config-[kernel_version]) is the config file that was USED to build the kernel. This means that every change you make to this config file does not affect anything.
To really make these changes you need to construct a new config file, with the modifications you require and compile and install a new kernel based on that config file. You can use the config file from /boot and just make the clock adjustments to fit.
Related
I've been experimenting with a random write workload where I use multiple threads to write to disjoint offsets within one or more files on an NVMe SSD. I'm using a Linux machine and the writes are synchronous and are made using direct I/O (i.e., the files are opened with O_DSYNC and O_DIRECT).
I noticed that if the threads write concurrently to a single file, the achieved write throughput does not increase when the number of threads increases (i.e., the writes appear to be applied serially and not in parallel). However if each thread writes to its own file, I do get throughput increases (up to the SSD's manufacturer-advertised random write throughput). See the graph below for my throughput measurements.
I was wondering if anyone knows why I'm not able to get throughput increases if I have multiple threads concurrently writing to non-overlapping regions in the same file?
Here are some additional details about my experimental setup.
I'm writing 2 GiB of data (random write) and varying the number of threads used to do the write (from 1 to 16). Each thread writes 4 KiB of data at a time. I'm considering two setups: (1) all threads write to a single file, and (2) each thread writes to its own file. Before starting the benchmark, the file(s) used are opened and are initialized to their final size using fallocate(). The file(s) are opened with O_DIRECT and O_DSYNC. Each thread is assigned a random disjoint subset of the offsets within the file (i.e., the regions the threads write to are non-overlapping). Then, the threads concurrently write to these offsets using pwrite().
Here are the machine's specifications:
Linux 5.9.1-arch1-1
1 TB Intel NVMe SSD (model SSDPE2KX010T8)
ext4 file system
128 GiB of memory
2.10 GHz 20-core Xeon Gold 6230 CPU
The SSD is supposed to be capable of delivering up to 70000 IOPS of random writes.
I've included a standalone C++ program that I've used to reproduce this behavior on my machine. I've been compiling using g++ -O3 -lpthread <file> (I'm using g++ version 10.2.0).
#include <algorithm>
#include <cassert>
#include <chrono>
#include <cstdlib>
#include <cstring>
#include <iostream>
#include <random>
#include <thread>
#include <vector>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
constexpr size_t kBlockSize = 4 * 1024;
constexpr size_t kDataSizeMiB = 2048;
constexpr size_t kDataSize = kDataSizeMiB * 1024 * 1024;
constexpr size_t kBlocksTotal = kDataSize / kBlockSize;
constexpr size_t kRngSeed = 42;
void AllocFiles(unsigned num_files, size_t blocks_per_file,
std::vector<int> &fds,
std::vector<std::vector<size_t>> &write_pos) {
std::mt19937 rng(kRngSeed);
for (unsigned i = 0; i < num_files; ++i) {
const std::string path = "f" + std::to_string(i);
fds.push_back(open(path.c_str(), O_CREAT | O_WRONLY | O_DIRECT | O_DSYNC,
S_IRUSR | S_IWUSR));
write_pos.emplace_back();
auto &file_offsets = write_pos.back();
int fd = fds.back();
for (size_t blk = 0; blk < blocks_per_file; ++blk) {
file_offsets.push_back(blk * kBlockSize);
}
fallocate(fd, /*mode=*/0, /*offset=*/0, blocks_per_file * kBlockSize);
std::shuffle(file_offsets.begin(), file_offsets.end(), rng);
}
}
void ThreadMain(int fd, const void *data, const std::vector<size_t> &write_pos,
size_t offset, size_t num_writes) {
for (size_t i = 0; i < num_writes; ++i) {
pwrite(fd, data, kBlockSize, write_pos[i + offset]);
}
}
int main(int argc, char *argv[]) {
assert(argc == 3);
unsigned num_threads = strtoul(argv[1], nullptr, 10);
unsigned files = strtoul(argv[2], nullptr, 10);
assert(num_threads % files == 0);
assert(num_threads >= files);
assert(kBlocksTotal % num_threads == 0);
void *data_buf;
posix_memalign(&data_buf, 512, kBlockSize);
*reinterpret_cast<uint64_t *>(data_buf) = 0xFFFFFFFFFFFFFFFF;
std::vector<int> fds;
std::vector<std::vector<size_t>> write_pos;
std::vector<std::thread> threads;
const size_t blocks_per_file = kBlocksTotal / files;
const unsigned threads_per_file = num_threads / files;
const unsigned writes_per_thread_per_file =
blocks_per_file / threads_per_file;
AllocFiles(files, blocks_per_file, fds, write_pos);
const auto begin = std::chrono::steady_clock::now();
for (unsigned thread_id = 0; thread_id < num_threads; ++thread_id) {
unsigned thread_file_offset = thread_id / files;
threads.emplace_back(
&ThreadMain, fds[thread_id % files], data_buf,
write_pos[thread_id % files],
/*offset=*/(thread_file_offset * writes_per_thread_per_file),
/*num_writes=*/writes_per_thread_per_file);
}
for (auto &thread : threads) {
thread.join();
}
const auto end = std::chrono::steady_clock::now();
for (const auto &fd : fds) {
close(fd);
}
std::cout << kDataSizeMiB /
std::chrono::duration_cast<std::chrono::duration<double>>(
end - begin)
.count()
<< std::endl;
free(data_buf);
return 0;
}
In this scenario, the underlying reason was that ext4 was taking an exclusive lock when writing to the file. To get the multithreaded throughput scaling that we would expect when writing to the same file, I needed to make two changes:
The file needed to be "preallocated." This means that we need to make at least one actual write to every block in the file that we plan on writing to (e.g., writing zeros to the whole file).
The buffer used for making the write needs to be aligned to the file system's block size. In my case the buffer should have been aligned to 4096.
// What I had
posix_memalign(&data_buf, 512, kBlockSize);
// What I actually needed
posix_memalign(&data_buf, 4096, kBlockSize);
With these changes, using multiple threads to make non-overlapping random writes to a single file leads to the same throughput gains as if the threads each wrote to their own file.
Basically I'm using Linux 2.6.34 on PowerPC (Freescale e500mc). I have a process (a kind of VM that was developed in-house) that uses about 2.25 G of mlocked VM. When I kill it, I notice that it takes upwards of 2 minutes to terminate.
I investigated a little. First, I closed all open file descriptors but that didn't seem to make a difference. Then I added some printk in the kernel and through it I found that all delay comes from the kernel unlocking my VMAs. The delay is uniform across pages, which I verified by repeatedly checking the locked page count in /proc/meminfo. I've checked with programs that allocate that much memory and they all die as soon as I signal them.
What do you think I should check now? Thanks for your replies.
Edit: I had to find a way to share more information about the problem so I wrote this below program:
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
#include <signal.h>
#include <sys/time.h>
#define MAP_PERM_1 (PROT_WRITE | PROT_READ | PROT_EXEC)
#define MAP_PERM_2 (PROT_WRITE | PROT_READ)
#define MAP_FLAGS (MAP_ANONYMOUS | MAP_FIXED | MAP_PRIVATE)
#define PG_LEN 4096
#define align_pg_32(addr) (addr & 0xFFFFF000)
#define num_pg_in_range(start, end) ((end - start + 1) >> 12)
inline void __force_pgtbl_alloc(unsigned int start)
{
volatile int *s = (int *) start;
*s = *s;
}
int __map_a_page_at(unsigned int start, int whichperm)
{
int perm = whichperm ? MAP_PERM_1 : MAP_PERM_2;
if(MAP_FAILED == mmap((void *)start, PG_LEN, perm, MAP_FLAGS, 0, 0)){
fprintf(stderr,
"mmap failed at 0x%x: %s.\n",
start, strerror(errno));
return 0;
}
return 1;
}
int __mlock_page(unsigned int addr)
{
if (mlock((void *)addr, (size_t)PG_LEN) < 0){
fprintf(stderr,
"mlock failed on page: 0x%x: %s.\n",
addr, strerror(errno));
return 0;
}
return 1;
}
void sigint_handler(int p)
{
struct timeval start = {0 ,0}, end = {0, 0}, diff = {0, 0};
gettimeofday(&start, NULL);
munlockall();
gettimeofday(&end, NULL);
timersub(&end, &start, &diff);
printf("Munlock'd entire VM in %u secs %u usecs.\n",
diff.tv_sec, diff.tv_usec);
exit(0);
}
int make_vma_map(unsigned int start, unsigned int end)
{
int num_pg = num_pg_in_range(start, end);
if (end < start){
fprintf(stderr,
"Bad range: start: 0x%x end: 0x%x.\n",
start, end);
return 0;
}
for (; num_pg; num_pg --, start += PG_LEN){
if (__map_a_page_at(start, num_pg % 2) && __mlock_page(start))
__force_pgtbl_alloc(start);
else
return 0;
}
return 1;
}
void display_banner()
{
printf("-----------------------------------------\n");
printf("Virtual memory allocator. Ctrl+C to exit.\n");
printf("-----------------------------------------\n");
}
int main()
{
unsigned int vma_start, vma_end, input = 0;
int start_end = 0; // 0: start; 1: end;
display_banner();
// Bind SIGINT handler.
signal(SIGINT, sigint_handler);
while (1){
if (!start_end)
printf("start:\t");
else
printf("end:\t");
scanf("%i", &input);
if (start_end){
vma_end = align_pg_32(input);
make_vma_map(vma_start, vma_end);
}
else{
vma_start = align_pg_32(input);
}
start_end = !start_end;
}
return 0;
}
As you would see, the program accepts ranges of virtual addresses, each range being defined by start and end. Each range is then further subdivided into page-sized VMAs by giving different permissions to adjacent pages. Interrupting (using SIGINT) the program triggers a call to munlockall() and the time for said procedure to complete is duly noted.
Now, when I run it on freescale e500mc with Linux version at 2.6.34 over the range 0x30000000-0x35000000, I get a total munlockall() time of almost 45 seconds. However, if I do the same thing with smaller start-end ranges in random orders (that is, not necessarily increasing addresses) such that the total number of pages (and locked VMAs) is roughly the same, observe total munlockall() time to be no more than 4 seconds.
I tried the same thing on x86_64 with Linux 2.6.34 and my program compiled against the -m32 parameter and it seems the variations, though not so pronounced as with ppc, are still 8 seconds for the first case and under a second for the second case.
I tried the program on Linux 2.6.10 on the one end and on 3.19, on the other and it seems these monumental differences don't exist there. What's more, munlockall() always completes at under a second.
So, it seems that the problem, whatever it is, exists only around the 2.6.34 version of the Linux kernel.
You said the VM was developed in-house. Does this mean you have access to the source? I would start by checking to see if it has anything to stop it from immediately terminating to avoid data loss.
Otherwise, could you potentially try to provide more information? You may also want to check out: https://unix.stackexchange.com/ as they would be better suited to help with any issues the linux kernel may be having.
I am using the gpio-keys device driver to handle some buttons in an embedded device running Linux. Applications in user space can just open /dev/input/eventX and read input events in a loop.
My question is how to get the initial states of the buttons. There is an ioctl call (EVIOCGKEY) which can be used for this, however if I first check this and then start to read from /dev/input/eventX, there's no way to guarantee that the state did not change in between.
Any suggestions?
The evdev devices queue events until you read() them, so in most cases opening the device, doing the ioctl() and immediately starting to read events from it should work. If the driver dropped some events from the queue, it sends you a SYN_DROPPED event, so you can detect situations where that happened. The libevdev documentation has some ideas on how one should handle that situation; the way I read it you should simply retry, i.e. drop all pending events, and redo the ioctl() until there are no more SYN_DROPPED events.
I used this code to verify that this approach works:
#include <stdio.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <linux/input.h>
#include <string.h>
#define EVDEV "/dev/input/event9"
int main(int argc, char **argv) {
unsigned char key_states[KEY_MAX/8 + 1];
struct input_event evt;
int fd;
memset(key_states, 0, sizeof(key_states));
fd = open(EVDEV, O_RDWR);
ioctl(fd, EVIOCGKEY(sizeof(key_states)), key_states);
// Create some inconsistency
printf("Type (lots) now to make evdev drop events from the queue\n");
sleep(5);
printf("\n");
while(read(fd, &evt, sizeof(struct input_event)) > 0) {
if(evt.type == EV_SYN && evt.code == SYN_DROPPED) {
printf("Received SYN_DROPPED. Restart.\n");
fsync(fd);
ioctl(fd, EVIOCGKEY(sizeof(key_states)), key_states);
}
else if(evt.type == EV_KEY) {
// Ignore repetitions
if(evt.value > 1) continue;
key_states[evt.code / 8] ^= 1 << (evt.code % 8);
if((key_states[evt.code / 8] >> (evt.code % 8)) & 1 != evt.value) {
printf("Inconsistency detected: Keycode %d is reported as %d, but %d is stored\n", evt.code, evt.value,
(key_states[evt.code / 8] >> (evt.code % 8)) & 1);
}
}
}
}
After starting, the program deliberately waits 5 seconds. Hit some keys in that time to fill the buffer. On my system, I need to enter about 70 characters to trigger a SYN_DROPPED. The EV_KEY handling code checks if the events are consistent with the state reported by the EVIOCGKEY ioctl.
I am trying to write a program that will constantly keep track of the changes in a file and do several actions accordingly. I am using inotify and select within a loop to track file modifications in a non-blocking manner. The basic structure of the file tracking portion of my program is as follows.
#include <cstdio>
#include <signal.h>
#include <limits.h>
#include <sys/inotify.h>
#include <fcntl.h>
#include <iostream>
#include <fstream>
#include <string>
int main( int argc, char **argv )
{
const char *filename = "input.txt";
int inotfd = inotify_init();
char buffer[1];
int watch_desc = inotify_add_watch(inotfd, filename, IN_MODIFY);
size_t bufsiz = sizeof(struct inotify_event) + 1;
struct inotify_event* event = ( struct inotify_event * ) &buffer[0];
fd_set rfds;
FD_ZERO (&rfds);
struct timeval timeout;
while(1)
{
/*select() intitialisation.*/
FD_SET(inotfd,&rfds); //keyboard to be listened
timeout.tv_sec = 10;
timeout.tv_usec = 0;
int res=select(FD_SETSIZE,&rfds,NULL,NULL,&timeout);
FD_ZERO(&rfds);
printf("File Changed\n");
}
}
I checked the select manual page and reset the fd_set descriptor each time select() returns. However, whenever I modify the file (input.txt), this code just loops infinitely. I not very experienced using inotify and select, so, I am sure if the problem is with the way I use inotify or select. I would appreciate any hints and recommentations.
you have to read the contents of the buffer after the select returns. if the select() finds data in the buffer, it returns. so, perform read() on that file descriptor (inotfd). read call reads the data and returns amount of bytes it read. now, the buffer is empty and in the next iteration, the select() call waits until any data is available in the buffer.
while(1)
{
// ...
char pBuf[1024];
res=select(FD_SETSIZE,&rfds,NULL,NULL,&timeout);
read(inotfd,&pBuf, BUF_SIZE);
// ...
}
There is a test program to work with setitimer on Linux (kernel 2.6; HZ=100). It sets various itimers to send signal every 10 ms (actually it is set as 9ms, but the timeslice is 10 ms). Then program runs for some fixed time (e.g. 30 sec) and counts signals.
Is it guaranteed that signal count will be proportional to running time? Will count be the same in every run and with every timer type (-r -p -v)?
Note, on the system should be no other cpu-active processes; and the question is about fixed-HZ kernel.
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <sys/time.h>
/* Use 9 ms timer */
#define usecs 9000
int events = 0;
void count(int a) {
events++;
}
int main(int argc, char**argv)
{
int timer,j,i,k=0;
struct itimerval timerval = {
.it_interval = {.tv_sec=0, .tv_usec=usecs},
.it_value = {.tv_sec=0, .tv_usec=usecs}
};
if ( (argc!=2) || (argv[1][0]!='-') ) {
printf("Usage: %s -[rpv]\n -r - ITIMER_REAL\n -p - ITIMER_PROF\n -v - ITIMER_VIRTUAL\n", argv[0]);
exit(0);
}
switch(argv[1][1]) {
case'r':
timer=ITIMER_REAL;
break;
case'p':
timer=ITIMER_PROF;
break;
case'v':
timer=ITIMER_VIRTUAL;
};
signal(SIGALRM,count);
signal(SIGPROF,count);
signal(SIGVTALRM,count);
setitimer(timer, &timerval, NULL);
/* constants should be tuned to some huge value */
for (j=0; j<4; j++)
for (i=0; i<2000000000; i++)
k += k*argc + 5*k + argc*3;
printf("%d events\n",events);
return 0;
}
Is it guaranteed that signal count will be proportional to running time?
Yes. In general, for all the three timers the longer the code runs, the more the number of signals received.
Will count be the same in every run and with every timer type (-r -p -v)?
No.
When the timer is set using ITIMER_REAL, the timer decrements in real time.
When it is set using ITIMER_VIRTUAL, the timer decrements only when the process is executing in the user address space. So, it doesn't decrement when the process makes a system call or during interrupt service routines.
So we can expect that #real_signals > #virtual_signals
ITIMER_PROF timers decrement both during user space execution of the process and when the OS is executing on behalf of the process i.e. during system calls.
So #prof_signals > #virtual_signals
ITIMER_PROF doesn't decrement when OS is not executing on behalf of the process. So #real_signals > #prof_signals
To summarise, #real_signals > #prof_signals > #virtual_signals.