I have an homework assignment, we need to add some features to Linux kernel, and we're working on red hat 2.4.18.
I looked in sched.c, function set_user_nice:
void set_user_nice(task_t *p, long nice)
{
unsigned long flags;
prio_array_t *array;
runqueue_t *rq;
if (TASK_NICE(p) == nice || nice < -20 || nice > 19)
return;
/*
* We have to be careful, if called from sys_setpriority(),
* the task might be in the middle of scheduling on another CPU.
*/
rq = task_rq_lock(p, &flags);
if (rt_task(p)) {
p->static_prio = NICE_TO_PRIO(nice);
goto out_unlock;
}
array = p->array;
if (array)
dequeue_task(p, array);
p->static_prio = NICE_TO_PRIO(nice);
p->prio = NICE_TO_PRIO(nice);
if (array) {
enqueue_task(p, array);
/*
* If the task is running and lowered its priority,
* or increased its priority then reschedule its CPU:
*/
if ((NICE_TO_PRIO(nice) < p->static_prio) || (p == rq->curr))
resched_task(rq->curr);
}
out_unlock:
task_rq_unlock(rq, &flags);
}
I don't understand what exactly the code checks in the last if statement,
because few lines above it, we have this line:
p->static_prio = NICE_TO_PRIO(nice);
and then, in the if statement we check:
(NICE_TO_PRIO(nice) < p->static_prio)
Am I missing something?
Thanks
OK, so I looked for this function in a newer source code, and this function implemented in kernel/sched/core.c.
The part I was talking about:
old_prio = p->prio;
3585 p->prio = effective_prio(p);
3586 delta = p->prio - old_prio;
3587
3588 if (queued) {
3589 enqueue_task(rq, p, ENQUEUE_RESTORE);
3590 /*
3591 * If the task increased its priority or is running and
3592 * lowered its priority, then reschedule its CPU:
3593 */
3594 if (delta < 0 || (delta > 0 && task_running(rq, p)))
3595 resched_curr(rq);
3596 }
3597 out_unlock:
So it does seem like now the diff between the old and the new priority calculated properly.
Is 2.4.18 the kernel version? I'm looking at that source and don't see set_user_nice in sched.c.
Anyways, I think that they're handling a race condition there. It's possible between the time they set the new process priority, the process itself has changed it. So they're checking if that's the case and re-scheduling the task if so.
Related
This question is originating from here: Overlapping communication and computation taking 2.1 times as much time.
I've implemented Cannon's algorithm which performs distributed memory tensor-matrix multiplication. During this, I thought it would be clever to hide communication latencies by overlapping computation and communication.
Now I started micro-benchmarking the components i.e. the communication, the computation and the overlapped communication and computation, and something funny has come out of it. The overlapped operation is taking 2.1 times as long as the longest time taking operation of the two. Only sending a single message took 521639 us, only computing on the (same sized) data took 340435 us but the act of overlapping them took 1111500 us.
After numerous test-runs involving independent data buffers and valuable inputs in the form of comments, I have come to the conclusion that the problem is being caused by MPI's weird concept of progress.
The following is the desired behaviour:
only the thread identified by COMM_THREAD handles the communication and
all the other threads perform the computation.
If the above behaviour can be forced, in the above example, I expect to see the overlapped operation take ~521639 us.
Information:
The MPI implementation is by Intel as part of OneAPI v2021.6.0.
A single compute node has 2 sockets of Intel Xeon Platinum 8168 (2x 24 = 48 cores).
SMT is not being made use of i.e. each thread is pinned to a single core.
Data is being initialized before each experiment by being mapped to the corresponding memory nodes as required by the computation.
Benchmarking runs are preceded by 10 warm-up runs.
In the given example, the tensor is sized N=600 i.e. has 600^3 data-points. However, the same behaviour was observed for smaller sizes as well.
What I've tried:
Just making asynchronous calls in the overlap:
// ...
#define COMM_THREAD 0
// ...
#pragma omp parallel
{
if (omp_get_thread_num() == COMM_THREAD)
{
// perform the comms.
auto requests = std::array<MPI_Request, 4>{};
const auto r1 = MPI_Irecv(tens_recv_buffer_->data(), 2 * tens_recv_buffer_->size(), MPI_DOUBLE, src_proc_id_tens_,
2, MPI_COMM_WORLD, &requests[0]);
const auto s1 = MPI_Isend(tens_send_buffer_->data(), 2 * tens_send_buffer_->size(), MPI_DOUBLE,
target_proc_id_tens_, 2, MPI_COMM_WORLD, &requests[1]);
const auto r2 = MPI_Irecv(mat_recv_buffer_->data(), 2 * mat_recv_buffer_->size(), MPI_DOUBLE, src_proc_id_mat_,
3, MPI_COMM_WORLD, &requests[2]);
const auto s2 = MPI_Isend(mat_send_buffer_->data(), 2 * mat_send_buffer_->size(), MPI_DOUBLE,
target_proc_id_mat_, 3, MPI_COMM_WORLD, &requests[3]);
if (MPI_SUCCESS != s1 || MPI_SUCCESS != r1 || MPI_SUCCESS != s2 || MPI_SUCCESS != r2)
{
throw std::runtime_error("tensor_matrix_mult_mpi_sendrecv_error");
}
if (MPI_SUCCESS != MPI_Waitall(requests.size(), requests.data(), MPI_STATUSES_IGNORE))
{
throw std::runtime_error("tensor_matrix_mult_mpi_waitall_error");
}
}
else
{
const auto work_indices = schedule_thread_work(tens_recv_buffer_->get_n1(), 1);
shared_mem::tensor_matrix_mult(*tens_send_buffer_, *mat_send_buffer_, *result_, work_indices);
}
}
Trying manual progression:
if (omp_get_thread_num() == COMM_THREAD)
{
// perform the comms.
auto requests = std::array<MPI_Request, 4>{};
const auto r1 = MPI_Irecv(tens_recv_buffer_->data(), 2 * tens_recv_buffer_->size(), MPI_DOUBLE, src_proc_id_tens_,
2, MPI_COMM_WORLD, &requests[0]);
const auto s1 = MPI_Isend(tens_send_buffer_->data(), 2 * tens_send_buffer_->size(), MPI_DOUBLE,
target_proc_id_tens_, 2, MPI_COMM_WORLD, &requests[1]);
const auto r2 = MPI_Irecv(mat_recv_buffer_->data(), 2 * mat_recv_buffer_->size(), MPI_DOUBLE, src_proc_id_mat_,
3, MPI_COMM_WORLD, &requests[2]);
const auto s2 = MPI_Isend(mat_send_buffer_->data(), 2 * mat_send_buffer_->size(), MPI_DOUBLE,
target_proc_id_mat_, 3, MPI_COMM_WORLD, &requests[3]);
if (MPI_SUCCESS != s1 || MPI_SUCCESS != r1 || MPI_SUCCESS != s2 || MPI_SUCCESS != r2)
{
throw std::runtime_error("tensor_matrix_mult_mpi_sendrecv_error");
}
// custom wait-all to ensure COMM_THREAD makes progress happen
auto comm_done = std::array<int, 4>{0, 0, 0, 0};
auto all_comm_done = false;
while(!all_comm_done)
{
auto open_comms = 0;
for (auto request_index = std::size_t{}; request_index < requests.size(); ++request_index)
{
if (comm_done[request_index])
{
continue;
}
MPI_Test(&requests[request_index], &comm_done[request_index], MPI_STATUS_IGNORE);
++open_comms;
}
all_comm_done = open_comms == 0;
}
}
else
{
const auto work_indices = schedule_thread_work(tens_recv_buffer_->get_n1(), 1);
shared_mem::tensor_matrix_mult(*tens_send_buffer_, *mat_send_buffer_, *result_, work_indices);
}
Using the environment variables mentioned here: https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/environment-variables-for-async-progress-control.html in my job-script:
export I_MPI_ASYNC_PROGRESS=1 I_MPI_ASYNC_PROGRESS_THREADS=1 I_MPI_ASYNC_PROGRESS_PIN="0"
and then running the code in variant 1.
All of the above attempts have resulted in the same undesirable behaviour.
Question: How can I force only COMM_THREAD to participate in MPI progression?
Any thoughts, suggestions, speculations and ideas will be greatly appreciated. Thanks in advance.
Note: although the buffers tens_send_buffer_ and mat_send_buffer_ are accessed concurrently during the overlap, this is read-only access.
It's clear to me that perf always records one or more events, and the sampling can be counter-based or time-based. But when the -e and -F switches are not given, what is the default behavior of perf record? The manpage for perf-record doesn't tell you what it does in this case.
The default event is cycles, as can be seen by running perf script after perf record. There, you can also see that the default sampling behavior is time-based, since the number of cycles is not constant. The default frequency is 4000 Hz, which can be seen in the source code and checked by comparing the file size or number of samples to a recording where -F 4000 was specified.
The perf wiki says that the rate is 1000 Hz, but this is not true anymore for kernels newer than 3.4.
Default event selection in perf record is done in user-space perf tool which is usually distributed as part of linux kernel. With make perf-src-tar-gz from linux kernel source dir we can make tar gz for quick rebuild or download such tar from https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf. There are also several online "LXR" cross-reference viewers for linux kernel source which can be used just like grep to learn about perf internals.
There is the function to select default event list (evlist) for perf record: __perf_evlist__add_default of tools/perf/util/evlist.c file:
int __perf_evlist__add_default(struct evlist *evlist, bool precise)
{
struct evsel *evsel = perf_evsel__new_cycles(precise);
evlist__add(evlist, evsel);
return 0;
}
Called from perf record implementation in case of zero events parsed from options: tools/perf/builtin-record.c: int cmd_record()
rec->evlist->core.nr_entries == 0 &&
__perf_evlist__add_default(rec->evlist, !record.opts.no_samples)
And perf_evsel__new_cycles will ask for hardware event cycles (PERF_TYPE_HARDWARE + PERF_COUNT_HW_CPU_CYCLES) with optional kernel sampling, and max precise (check modifiers in man perf-list, it is EIP sampling skid workarounds using PEBS or IBS):
struct evsel *perf_evsel__new_cycles(bool precise)
{
struct perf_event_attr attr = {
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
.exclude_kernel = !perf_event_can_profile_kernel(),
};
struct evsel *evsel;
/*
* Now let the usual logic to set up the perf_event_attr defaults
* to kick in when we return and before perf_evsel__open() is called.
*/
evsel = evsel__new(&attr);
evsel->precise_max = true;
/* use asprintf() because free(evsel) assumes name is allocated */
if (asprintf(&evsel->name, "cycles%s%s%.*s",
(attr.precise_ip || attr.exclude_kernel) ? ":" : "",
attr.exclude_kernel ? "u" : "",
attr.precise_ip ? attr.precise_ip + 1 : 0, "ppp") < 0)
return evsel;
}
In case of failed perf_event_open (no access to hardware cycles sampling, for example in virtualized environment without virtualized PMU) there is failback to software cpu-clock sampling in tools/perf/builtin-record.c: int record__open() which calls perf_evsel__fallback() of tools/perf/util/evsel.c:
bool perf_evsel__fallback(struct evsel *evsel, int err,
char *msg, size_t msgsize)
{
if ((err == ENOENT || err == ENXIO || err == ENODEV) &&
evsel->core.attr.type == PERF_TYPE_HARDWARE &&
evsel->core.attr.config == PERF_COUNT_HW_CPU_CYCLES) {
/*
* If it's cycles then fall back to hrtimer based
* cpu-clock-tick sw counter, which is always available even if
* no PMU support.
*/
scnprintf(msg, msgsize, "%s", "The cycles event is not supported, trying to fall back to cpu-clock-ticks");
evsel->core.attr.type = PERF_TYPE_SOFTWARE;
evsel->core.attr.config = PERF_COUNT_SW_CPU_CLOCK;
return true;
} ...
}
I created a CUDA function for calculating the sum of an image using its histogram.
I'm trying to compile the kernel and the wrapper function for multiple compute capabilities.
Kernel:
__global__ void calc_hist(unsigned char* pSrc, int* hist, int width, int height, int pitch)
{
int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
#if __CUDA_ARCH__ > 110 //Shared Memory For Devices Above Compute 1.1
__shared__ int shared_hist[256];
#endif
int global_tid = yIndex * pitch + xIndex;
int block_tid = threadIdx.y * blockDim.x + threadIdx.x;
if(xIndex>=width || yIndex>=height) return;
#if __CUDA_ARCH__ == 110 //Calculate Histogram In Global Memory For Compute 1.1
atomicAdd(&hist[pSrc[global_tid]],1); /*< Atomic Add In Global Memory */
#elif __CUDA_ARCH__ > 110 //Calculate Histogram In Shared Memory For Compute Above 1.1
shared_hist[block_tid] = 0; /*< Clear Shared Memory */
__syncthreads();
atomicAdd(&shared_hist[pSrc[global_tid]],1); /*< Atomic Add In Shared Memory */
__syncthreads();
if(shared_hist[block_tid] > 0) /* Only Write Non Zero Bins Into Global Memory */
atomicAdd(&(hist[block_tid]),shared_hist[block_tid]);
#else
return; //Do Nothing For Devices Of Compute Capabilty 1.0
#endif
}
Wrapper Function:
int sum_8u_c1(unsigned char* pSrc, double* sum, int width, int height, int pitch, cudaStream_t stream = NULL)
{
#if __CUDA_ARCH__ == 100
printf("Compute Capability Not Supported\n");
return 0;
#else
int *hHist,*dHist;
cudaMalloc(&dHist,256*sizeof(int));
cudaHostAlloc(&hHist,256 * sizeof(int),cudaHostAllocDefault);
cudaMemsetAsync(dHist,0,256 * sizeof(int),stream);
dim3 Block(16,16);
dim3 Grid;
Grid.x = (width + Block.x - 1)/Block.x;
Grid.y = (height + Block.y - 1)/Block.y;
calc_hist<<<Grid,Block,0,stream>>>(pSrc,dHist,width,height,pitch);
cudaMemcpyAsync(hHist,dHist,256 * sizeof(int),cudaMemcpyDeviceToHost,stream);
cudaStreamSynchronize(stream);
(*sum) = 0.0;
for(int i=1; i<256; i++)
(*sum) += (hHist[i] * i);
printf("sum = %f\n",(*sum));
cudaFree(dHist);
cudaFreeHost(hHist);
return 1;
#endif
}
Question 1:
When compiling for sm_10, the wrapper and the kernel shouldn't execute. But that is not what happens. The whole wrapper function executes. The output shows sum = 0.0.
I expected the output to be Compute Capability Not Supported as I have added the printf statement in the start of the wrapper function.
How can I prevent the wrapper function from executing on sm_10? I don't want to add any run-time checks like if statements etc. Can it be achieved through template meta programming?
Question 2:
When compiling for greater than sm_10, the program executes correctly only if I add cudaStreamSynchronize after the kernel call. But if I do not synchronize, the output is sum = 0.0. Why is it happening? I want the function to be asynchronous w.r.t the host as much as possible. Is it possible to shift the only loop inside the kernel?
I am using GTX460M, CUDA 5.0, Visual Studio 2008 on Windows 8.
Ad. Question 1
As already Robert explained in the comments - __CUDA_ARCH__ is defined only when compiling device code. To clarify: when you invoke nvcc, the code is parsed and compiled twice - once for CPU and once for GPU. The existence of __CUDA_ARCH__ can be used to check which of those two passes occurs, and then for the device code - as you do in the kernel - it can be checked which GPU are you targetting.
However, for the host side it is not all lost. While you don't have __CUDA_ARCH__, you can call API function cudaGetDeviceProperties which returns lots of information about your GPU. In particular, you can be interested in fields major and minor which indicate the Compute Capability. Note - this is done at run-time, not a preprocessing stage, so the same CPU code will work on all GPUs.
Ad. Question 2
Kernel calls and cudaMemoryAsync are asynchronous. It means that if you don't call cudaStreamSynchronize (or alike) the followup CPU code will continue running even if your GPU hasn't finished your work. This means, that the data you copy from dHist to hHist might not be there yet when you begin operating on hHist in the loop. If you want to work on the output from a kernel you have to wait till the kernel finishes.
Note that cudaMemcpy (without Async) has an implicit synchronization inside.
There are a number of posts and references on how to get CPU Utilization using statistics in /proc/stat. However, most of them use only four of the 7+ CPU stats (user, nice, system, and idle), ignoring the remaining jiffie CPU counts present in Linux 2.6 (iowait, irq, softirq).
As an example, see Determining CPU utilization.
My question is this: Are the iowait/irq/softirq numbers also counted in one of the first four numbers (user/nice/system/idle)? In other words, does the total jiffie count equal the sum of the first four stats? Or, is the total jiffie count equal to the sum of all 7 stats? If the latter is true, then a CPU utilization formula should take all of the numbers into account, like this:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
long double a[7],b[7],loadavg;
FILE *fp;
for(;;)
{
fp = fopen("/proc/stat","r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf",&a[0],&a[1],&a[2],&a[3],&a[4],&a[5],&a[6]);
fclose(fp);
sleep(1);
fp = fopen("/proc/stat","r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf",&b[0],&b[1],&b[2],&b[3],&b[4],&b[5],&b[6]);
fclose(fp);
loadavg = ((b[0]+b[1]+b[2]+b[4]+b[5]+b[6]) - (a[0]+a[1]+a[2]+a[4]+a[5]+a[6]))
/ ((b[0]+b[1]+b[2]+b[3]+b[4]+b[5]+b[6]) - (a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]));
printf("The current CPU utilization is : %Lf\n",loadavg);
}
return(0);
}
I think iowait/irq/softirq are not counted in one of the first 4 numbers. You can see the comment of irqtime_account_process_tick in kernel code for more detail:
(for Linux kernel 4.1.1)
2815 * Tick demultiplexing follows the order
2816 * - pending hardirq update <-- this is irq
2817 * - pending softirq update <-- this is softirq
2818 * - user_time
2819 * - idle_time <-- iowait is included in here, discuss below
2820 * - system time
2821 * - check for guest_time
2822 * - else account as system_time
For the idle time handling, see account_idle_time function:
2772 /*
2773 * Account for idle time.
2774 * #cputime: the cpu time spent in idle wait
2775 */
2776 void account_idle_time(cputime_t cputime)
2777 {
2778 u64 *cpustat = kcpustat_this_cpu->cpustat;
2779 struct rq *rq = this_rq();
2780
2781 if (atomic_read(&rq->nr_iowait) > 0)
2782 cpustat[CPUTIME_IOWAIT] += (__force u64) cputime;
2783 else
2784 cpustat[CPUTIME_IDLE] += (__force u64) cputime;
2785 }
If the cpu is idle AND there is some IO pending, it will count the time in CPUTIME_IOWAIT. Otherwise, it is count in CPUTIME_IDLE.
To conclude, I think the jiffies in irq/softirq should be counted as "busy" for cpu because it was actually handling some IRQ or soft IRQ. On the other hand, the jiffies in "iowait" should be counted as "idle" for cpu because it was not doing something but waiting for a pending IO to happen.
from busybox, its top magic is:
static const char fmt[] ALIGN1 = "cp%*s %llu %llu %llu %llu %llu %llu %llu %llu";
int ret;
if (!fgets(line_buf, LINE_BUF_SIZE, fp) || line_buf[0] != 'c' /* not "cpu" */)
return 0;
ret = sscanf(line_buf, fmt,
&p_jif->usr, &p_jif->nic, &p_jif->sys, &p_jif->idle,
&p_jif->iowait, &p_jif->irq, &p_jif->softirq,
&p_jif->steal);
if (ret >= 4) {
p_jif->total = p_jif->usr + p_jif->nic + p_jif->sys + p_jif->idle
+ p_jif->iowait + p_jif->irq + p_jif->softirq + p_jif->steal;
/* procps 2.x does not count iowait as busy time */
p_jif->busy = p_jif->total - p_jif->idle - p_jif->iowait;
}
I am working on performance benchmarking of a SDIO UART Linux/Android driver and used current_kernel_time() at start and end of the to-be-analysed read, write function implementation, then printing the time difference.
Most of the time I get time difference as 0 (zero) nanoseconds (irrespective of size of the data to read/write : 16-2048 bytes) which logically I think is incorrect, only a very few times I get some values hopefully those are correct.
How reliable is the current_kernel_time()?
Why I get 0ns most of the times?
I am planning to profile at kernel level to get more details..before that can somebody throw some light on this behavior..has anybody observed anything like this before...
Also, any suggestions to help/correct my approach to benchmarking are also welcome!
Thank you.
EDIT:
This is the read code from Linux kernel version 2.6.32.9. I added current_kernel_time() as below under #ifdef-endif:
static void sdio_uart_receive_chars(struct sdio_uart_port *port, unsigned int *status)
{
#ifdef SDIO_UART_DEBUG
struct timespec time_spec1, time_spec2;
time_spec1 = current_kernel_time();
#endif
struct tty_struct *tty = port->tty;
unsigned int ch, flag;
int max_count = 256;
do {
ch = sdio_in(port, UART_RX);
flag = TTY_NORMAL;
port->icount.rx++;
if (unlikely(*status & (UART_LSR_BI | UART_LSR_PE |
UART_LSR_FE | UART_LSR_OE))) {
/*
* For statistics only
*/
if (*status & UART_LSR_BI) {
*status &= ~(UART_LSR_FE | UART_LSR_PE);
port->icount.brk++;
} else if (*status & UART_LSR_PE)
port->icount.parity++;
else if (*status & UART_LSR_FE)
port->icount.frame++;
if (*status & UART_LSR_OE)
port->icount.overrun++;
/*
* Mask off conditions which should be ignored.
*/
*status &= port->read_status_mask;
if (*status & UART_LSR_BI) {
flag = TTY_BREAK;
} else if (*status & UART_LSR_PE)
flag = TTY_PARITY;
else if (*status & UART_LSR_FE)
flag = TTY_FRAME;
}
if ((*status & port->ignore_status_mask & ~UART_LSR_OE) == 0)
tty_insert_flip_char(tty, ch, flag);
/*
* Overrun is special. Since it's reported immediately,
* it doesn't affect the current character.
*/
if (*status & ~port->ignore_status_mask & UART_LSR_OE)
tty_insert_flip_char(tty, 0, TTY_OVERRUN);
*status = sdio_in(port, UART_LSR);
} while ((*status & UART_LSR_DR) && (max_count-- > 0));
tty_flip_buffer_push(tty);
#ifdef SDIO_UART_DEBUG
time_spec2 = current_kernel_time();
printk(KERN_INFO "\n MY_DBG : read took: %ld nanoseconds",
(time_spec2.tv_sec - time_spec1.tv_sec) * 1000000000 + (time_spec2.tv_nsec - time_spec1.tv_nsec));
#endif
}
current_kernel_time is meant for timekeeping, not for performance measurement.
It returns, a value, not based on an actual timer, but on a time value that is updated by a timer interrupt. So the precision depends on the timer interrupt period.
and you get poor resolution.
However, perhaps getnstimeofday, is more suited to your need, since it also read the actual clock source to adjust the time value. It should be more fine grained.
Based on kernel source, maybe the best function is getrawmonotonic, in the unlikely event that the system time is adjusted backward during your measurement.