Accurately Calculating CPU Utilization in Linux using /proc/stat - linux

There are a number of posts and references on how to get CPU Utilization using statistics in /proc/stat. However, most of them use only four of the 7+ CPU stats (user, nice, system, and idle), ignoring the remaining jiffie CPU counts present in Linux 2.6 (iowait, irq, softirq).
As an example, see Determining CPU utilization.
My question is this: Are the iowait/irq/softirq numbers also counted in one of the first four numbers (user/nice/system/idle)? In other words, does the total jiffie count equal the sum of the first four stats? Or, is the total jiffie count equal to the sum of all 7 stats? If the latter is true, then a CPU utilization formula should take all of the numbers into account, like this:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
long double a[7],b[7],loadavg;
FILE *fp;
for(;;)
{
fp = fopen("/proc/stat","r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf",&a[0],&a[1],&a[2],&a[3],&a[4],&a[5],&a[6]);
fclose(fp);
sleep(1);
fp = fopen("/proc/stat","r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf",&b[0],&b[1],&b[2],&b[3],&b[4],&b[5],&b[6]);
fclose(fp);
loadavg = ((b[0]+b[1]+b[2]+b[4]+b[5]+b[6]) - (a[0]+a[1]+a[2]+a[4]+a[5]+a[6]))
/ ((b[0]+b[1]+b[2]+b[3]+b[4]+b[5]+b[6]) - (a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]));
printf("The current CPU utilization is : %Lf\n",loadavg);
}
return(0);
}

I think iowait/irq/softirq are not counted in one of the first 4 numbers. You can see the comment of irqtime_account_process_tick in kernel code for more detail:
(for Linux kernel 4.1.1)
2815 * Tick demultiplexing follows the order
2816 * - pending hardirq update <-- this is irq
2817 * - pending softirq update <-- this is softirq
2818 * - user_time
2819 * - idle_time <-- iowait is included in here, discuss below
2820 * - system time
2821 * - check for guest_time
2822 * - else account as system_time
For the idle time handling, see account_idle_time function:
2772 /*
2773 * Account for idle time.
2774 * #cputime: the cpu time spent in idle wait
2775 */
2776 void account_idle_time(cputime_t cputime)
2777 {
2778 u64 *cpustat = kcpustat_this_cpu->cpustat;
2779 struct rq *rq = this_rq();
2780
2781 if (atomic_read(&rq->nr_iowait) > 0)
2782 cpustat[CPUTIME_IOWAIT] += (__force u64) cputime;
2783 else
2784 cpustat[CPUTIME_IDLE] += (__force u64) cputime;
2785 }
If the cpu is idle AND there is some IO pending, it will count the time in CPUTIME_IOWAIT. Otherwise, it is count in CPUTIME_IDLE.
To conclude, I think the jiffies in irq/softirq should be counted as "busy" for cpu because it was actually handling some IRQ or soft IRQ. On the other hand, the jiffies in "iowait" should be counted as "idle" for cpu because it was not doing something but waiting for a pending IO to happen.

from busybox, its top magic is:
static const char fmt[] ALIGN1 = "cp%*s %llu %llu %llu %llu %llu %llu %llu %llu";
int ret;
if (!fgets(line_buf, LINE_BUF_SIZE, fp) || line_buf[0] != 'c' /* not "cpu" */)
return 0;
ret = sscanf(line_buf, fmt,
&p_jif->usr, &p_jif->nic, &p_jif->sys, &p_jif->idle,
&p_jif->iowait, &p_jif->irq, &p_jif->softirq,
&p_jif->steal);
if (ret >= 4) {
p_jif->total = p_jif->usr + p_jif->nic + p_jif->sys + p_jif->idle
+ p_jif->iowait + p_jif->irq + p_jif->softirq + p_jif->steal;
/* procps 2.x does not count iowait as busy time */
p_jif->busy = p_jif->total - p_jif->idle - p_jif->iowait;
}

Related

OpenMP worst performance with more threads (following openMP tutorials)

I'm starting to work with OpenMP and I follow these tutorials:
OpenMP Tutorials
I'm coding exactly what appears on the video, but instead of a better performance with more threads I get worse. I don't understand why.
Here's my code:
#include <iostream>
#include <time.h>
#include <omp.h>
using namespace std;
static long num_steps = 100000000;
double step;
#define NUM_THREADS 2
int main()
{
clock_t t;
t = clock();
int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double)num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id == 0) nthreads = nthrds;
for(i=id, sum[id]=0.0; i < num_steps; i = i + nthrds)
{
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i = 0, pi=0.0; i<nthreads; i++) pi += sum[i] * step;
t = clock() - t;
cout << "time: " << t << " miliseconds" << endl;
}
As you can see, it's exactly the same as in the video, I only added a code to measure an elapsed time.
On the tutorial, the more threads we use the better a performance.
In my case, that doesn't happen. Here are the timing I got:
1 thread: 433590 miliseconds
2 threads: 1705704 miliseconds
3 threads: 2689001 miliseconds
4 threads: 4221881 miliseconds
Why do I get this behavior?
-- EDIT --
gcc version: gcc 5.5.0
result of lscpu:
Architechure: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel(R) Core(TM) i7-4720HQ CPU # 2.60Ghz
Stepping: 3
CPU Mhz: 2594.436
CPU max MHz: 3600,0000
CPU min Mhz: 800,0000
BogoMIPS: 5188.41
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7
-- EDIT --
I've tried using omp_get_wtime() instead, like this:
#include <iostream>
#include <time.h>
#include <omp.h>
using namespace std;
static long num_steps = 100000000;
double step;
#define NUM_THREADS 8
int main()
{
int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double)num_steps;
double start_time = omp_get_wtime();
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id == 0) nthreads = nthrds;
for(i=id, sum[id]=0.0; i < num_steps; i = i + nthrds)
{
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i = 0, pi=0.0; i<nthreads; i++) pi += sum[i] * step;
double time = omp_get_wtime() - start_time;
cout << "time: " << time << " seconds" << endl;
}
The behavior is different, although I have some questions.
Now, if I increase the number of threads by 1, for example, 1 thread, 2 threads, 3, 4, ..., the results are basically the same as previous, the performance gets worse, although if I increase to 64 threads, or 128 threads I get indeed better performance, the timing decreases from 0.44 [s] (for 1 thread) to 0.13 [s] ( for 128 threads ).
My question is: Why I don't have the same behaviour as in the tutorial?
2 threads get better performance than 1,
3 threads get better performance than 2, etc.
Why do I only get better performance with much bigger amount of threads?
instead of better performances with more threads I get worse ... I don't understand why.
Well,let's make the testing a bit more systematic and repeatable to see if :
// time: 1535120 milliseconds 1 thread
// time: 200679 milliseconds 1 thread -O2
// time: 191205 milliseconds 1 thread -O3
// time: 184502 milliseconds 2 threads -O3
// time: 189947 milliseconds 3 threads -O3
// time: 202277 milliseconds 4 threads -O3
// time: 182628 milliseconds 5 threads -O3
// time: 192032 milliseconds 6 threads -O3
// time: 185771 milliseconds 7 threads -O3
// time: 187606 milliseconds 16 threads -O3
// time: 187231 milliseconds 32 threads -O3
// time: 186131 milliseconds 64 threads -O3
ref.: a few sample runs on a TiO.RUN platform fast mock-up ... where limited resources apply a certain glass-ceiling to hit...
This did show more the effects of { -O2 |-O3 }-compilation-mode optimisation effects, than the above proposed principal degradation for growing number of threads.
Next comes the "background" noise from non-managed code-execution ecosystem, where O/S will easily skew the simplistic performance benchmarking
If indeed interested in further details, feel free to read about a Law of diminishing returns ( about real world compositions of [SERIAL], resp. [PARALLEL] parts of the process-scheduling ), where Dr. Gene AMDAHL has initiated the principal rules,
why more threads do not get way better performance ( and where a bit more contemporary re-formulation of this law explains, why more threads may even get negative improvement ( get more expensive add-on overheads ), than a right-tuned peak performance.
#include <time.h>
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
using namespace std;
static long num_steps = 100000000;
double step;
#define NUM_THREADS 7
int main()
{
clock_t t;
t = clock();
int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0 / ( double )num_steps;
omp_set_num_threads( NUM_THREADS );
// struct timespec start;
// t = clock(); // _________________________________________ BEST START HERE
// clock_gettime( CLOCK_MONOTONIC, &start ); // ____________ USING MONOTONIC CLOCK
#pragma omp parallel
{
int i,
nthrds = omp_get_num_threads(),
id = omp_get_thread_num();;
double x;
if ( id == 0 ) nthreads = nthrds;
for ( i = id, sum[id] = 0.0;
i < num_steps;
i += nthrds
)
{
x = ( i + 0.5 ) * step;
sum[id] += 4.0 / ( 1.0 + x * x );
}
}
// t = clock() - t; // _____________________________________ BEST STOP HERE
// clock_gettime( CLOCK_MONOTONIC, &end ); // ______________ USING MONOTONIC CLOCK
for ( i = 0, pi = 0.0;
i < nthreads;
i++
) pi += sum[i] * step;
t = clock() - t;
// // time: 1535120 milliseconds 1 thread
// // time: 200679 milliseconds 1 thread -O2
// // time: 191205 milliseconds 1 thread -O3
printf( "time: %d milliseconds %d threads\n", // time: 184502 milliseconds 2 threads -O3
t, // time: 189947 milliseconds 3 threads -O3
NUM_THREADS // time: 202277 milliseconds 4 threads -O3
); // time: 182628 milliseconds 5 threads -O3
} // time: 192032 milliseconds 6 threads -O3
// time: 185771 milliseconds 7 threads -O3
The major problem in that version is false sharing. This is explained later in the video you started to watch. You get this when many threads are accessing data that is adjacent in memory (the sum array). The video also explains how to use padding to manually avoid this issue.
That said, the idiomatic solution is to use a reduction and not even bother with the manual work sharing:
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for(int i=0; i < num_steps; i++)
{
double x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
This is also explained in a later video of the series. It is much simpler than what you started with and most likely the most efficient way.
Although the presenter is certainly competent, the style of these OpenMP tutorial videos is very much bottom up. I'm not sure that is a good educational approach. In any case you should probably watch all of the videos to know how to best use OpenMP it in practice.
Why do I only get better performance with much bigger amount of threads?
This is a bit counterintuitive, you very rarely get better performance from using more OpenMP threads than hardware threads - unless this is indirectly fixing another issue. In your case the large amount of threads means that the sum array is spread out over a larger region in memory and false-sharing is less likely.

Large overhead in CUDA kernel launch outside GPU execution

I am measuring the running time of kernels, as seen from a CPU thread, by measuring the interval from before launching a kernel to after a cudaDeviceSynchronize (using gettimeofday). I have a cudaDeviceSynchronize before I start recording the interval. I also instrument the kernels to record the timestamp on the GPU (using clock64) at the start of the kernel by thread(0,0,0) of each block from block(0,0,0) to block(occupancy-1,0,0) to an array of size equal to number of SMs. Every thread at the end of the kernel code, updates the timestamp to another array (of the same size) at the index equal to the index of the SM it runs on.
The intervals calculated from the two arrays are 60-70% of that measured from the CPU thread.
For example, on a K40, while gettimeofday gives an interval of 140ms, the avg of intervals calculated from GPU timestamps is only 100ms. I have experimented with many grid sizes (15 blocks to 6K blocks) but have found similar behavior so far.
__global__ void some_kernel(long long *d_start, long long *d_end){
if(threadIdx.x==0){
d_start[blockIdx.x] = clock64();
}
//some_kernel code
d_end[blockIdx.x] = clock64();
}
Does this seem possible to the experts?
Does this seem possible to the experts?
I suppose anything is possible for code you haven't shown. After all, you may just have a silly bug in any of your computation arithmetic. But if the question is "is it sensible that there should be 40ms of unaccounted-for time overhead on a kernel launch, for a kernel that takes ~140ms to execute?" I would say no.
I believe the method I outlined in the comments is reasonably accurate. Take the minimum clock64() timestamp from any thread in the grid (but see note below regarding SM restriction). Compare it to the maximum time stamp of any thread in the grid. The difference will be comparable to the reported execution time of gettimeofday() to within 2 percent, according to my testing.
Here is my test case:
$ cat t1040.cu
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define LS_MAX 2000000000U
#define MAX_SM 64
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
__device__ int result;
__device__ unsigned long long t_start[MAX_SM];
__device__ unsigned long long t_end[MAX_SM];
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
__device__ __inline__ uint32_t __mysmid(){
uint32_t smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}
__global__ void kernel(unsigned ls){
unsigned long long int ts = clock64();
unsigned my_sm = __mysmid();
atomicMin(t_start+my_sm, ts);
// junk code to waste time
int tv = ts&0x1F;
for (unsigned i = 0; i < ls; i++){
tv &= (ts+i);}
result = tv;
// end of junk code
ts = clock64();
atomicMax(t_end+my_sm, ts);
}
// optional command line parameter 1 = kernel duration, parameter 2 = number of blocks, parameter 3 = number of threads per block
int main(int argc, char *argv[]){
unsigned ls;
if (argc > 1) ls = atoi(argv[1]);
else ls = 1000000;
if (ls > LS_MAX) ls = LS_MAX;
int num_sms = 0;
cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, 0);
cudaCheckErrors("cuda get attribute fail");
int gpu_clk = 0;
cudaDeviceGetAttribute(&gpu_clk, cudaDevAttrClockRate, 0);
if ((num_sms < 1) || (num_sms > MAX_SM)) {printf("invalid sm count: %d\n", num_sms); return 1;}
unsigned blks;
if (argc > 2) blks = atoi(argv[2]);
else blks = num_sms;
if ((blks < 1) || (blks > 0x3FFFFFFF)) {printf("invalid blocks: %d\n", blks); return 1;}
unsigned ntpb;
if (argc > 3) ntpb = atoi(argv[3]);
else ntpb = 256;
if ((ntpb < 1) || (ntpb > 1024)) {printf("invalid threads: %d\n", ntpb); return 1;}
kernel<<<1,1>>>(100); // warm up
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
unsigned long long *h_start, *h_end;
h_start = new unsigned long long[num_sms];
h_end = new unsigned long long[num_sms];
for (int i = 0; i < num_sms; i++){
h_start[i] = 0xFFFFFFFFFFFFFFFFULL;
h_end[i] = 0;}
cudaMemcpyToSymbol(t_start, h_start, num_sms*sizeof(unsigned long long));
cudaMemcpyToSymbol(t_end, h_end, num_sms*sizeof(unsigned long long));
unsigned long long htime = dtime_usec(0);
kernel<<<blks,ntpb>>>(ls);
cudaDeviceSynchronize();
htime = dtime_usec(htime);
cudaMemcpyFromSymbol(h_start, t_start, num_sms*sizeof(unsigned long long));
cudaMemcpyFromSymbol(h_end, t_end, num_sms*sizeof(unsigned long long));
cudaCheckErrors("some error");
printf("host elapsed time (ms): %f \n device sm clocks:\n start:", htime/1000.0f);
unsigned long long max_diff = 0;
for (int i = 0; i < num_sms; i++) {printf(" %12lu ", h_start[i]);}
printf("\n end: ");
for (int i = 0; i < num_sms; i++) {printf(" %12lu ", h_end[i]);}
for (int i = 0; i < num_sms; i++) if ((h_start[i] != 0xFFFFFFFFFFFFFFFFULL) && (h_end[i] != 0) && ((h_end[i]-h_start[i]) > max_diff)) max_diff=(h_end[i]-h_start[i]);
printf("\n max diff clks: %lu\nmax diff kernel time (ms): %f\n", max_diff, max_diff/(float)(gpu_clk));
return 0;
}
$ nvcc -o t1040 t1040.cu -arch=sm_35
$ ./t1040 1000000 1000 128
host elapsed time (ms): 2128.818115
device sm clocks:
start: 3484744 3484724
end: 2219687393 2228431323
max diff clks: 2224946599
max diff kernel time (ms): 2128.117432
$
Notes:
This code can only be run on a cc3.5 or higher GPU due to the use of 64-bit atomicMin and atomicMax.
I've run it on a variety of grid configurations, on both a GT640 (very low end cc3.5 device) and K40c (high end) and the timing results between host and device agree to within 2% (for reasonably long kernel execution times. If you pass 1 as the command line parameter, with very small grid sizes, the kernel execution time will be very short (nanoseconds) whereas the host will see about 10-20us. This is kernel launch overhead being measured. So the 2% number is for kernels that take much longer than 20us to execute).
It accepts 3 (optional) command line parameters, the first of which varies the amount of time the kernel will execute.
My timestamping is done on a per-SM basis, because the clock64() resource is indicated to be a per-SM resource. The sm clocks are not guaranteed to be synchronized between SMs.
You can modify the grid dimensions. The second optional command line parameter specifies the number of blocks to launch. The third optional command line parameter specifies the number of threads per block. The timing methodology I have shown here should not be dependent on number of blocks launched or number of threads per block. If you specify fewer blocks than SMs, the code should ignore "unused" SM data.

How to calculate CPU utilization of a process & all its child processes in Linux?

I want to know the CPU utilization of a process and all the child processes, for a fixed period of time, in Linux.
To be more specific, here is my use-case:
There is a process which waits for a request from the user to execute the programs. To execute the programs, this process invokes child processes (maximum limit of 5 at a time) & each of this child process executes 1 of these submitted programs (let's say user submitted 15 programs at once). So, if user submits 15 programs, then 3 batches of 5 child processes each will run. Child processes are killed as soon as they finish their execution of the program.
I want to know about % CPU Utilization for the parent process and all its child process during the execution of those 15 programs.
Is there any simple way to do this using top or another command? (Or any tool i should attach to the parent process.)
You can find this information in /proc/PID/stat where PID is your parent process's process ID. Assuming that the parent process waits for its children then the total CPU usage can be calculated from utime, stime, cutime and cstime:
utime %lu
Amount of time that this process has been scheduled in user mode,
measured in clock ticks (divide by sysconf(_SC_CLK_TCK). This includes
guest time, guest_time (time spent running a virtual CPU, see below),
so that applications that are not aware of the guest time field do not
lose that time from their calculations.
stime %lu
Amount of time that this process has been scheduled in kernel mode,
measured in clock ticks (divide by sysconf(_SC_CLK_TCK).
cutime %ld
Amount of time that this process's waited-for children have been
scheduled in user mode, measured in clock ticks (divide by
sysconf(_SC_CLK_TCK). (See also times(2).) This includes guest time,
cguest_time (time spent running a virtual CPU, see below).
cstime %ld
Amount of time that this process's waited-for children have been
scheduled in kernel mode, measured in clock ticks (divide by
sysconf(_SC_CLK_TCK).
See proc(5) manpage for details.
And of course you can do it in hardcore-way using good old C
find_cpu.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#define MAX_CHILDREN 100
/**
* System command execution output
* #param <char> command - system command to execute
* #returb <char> execution output
*/
char *system_output (const char *command)
{
FILE *pipe;
static char out[1000];
pipe = popen (command, "r");
fgets (out, sizeof(out), pipe);
pclose (pipe);
return out;
}
/**
* Finding all process's children
* #param <Int> - process ID
* #param <Int> - array of childs
*/
void find_children (int pid, int children[])
{
char empty_command[] = "/bin/ps h -o pid --ppid ";
char pid_string[5];
snprintf(pid_string, 5, "%d", pid);
char *command = (char*) malloc(strlen(empty_command) + strlen(pid_string) + 1);
sprintf(command, "%s%s", empty_command, pid_string);
FILE *fp = popen(command, "r");
int child_pid, i = 1;
while (fscanf(fp, "%i", &child_pid) != EOF)
{
children[i] = child_pid;
i++;
}
}
/**
* Parsign `ps` command output
* #param <char> out - ps command output
* #return <int> cpu utilization
*/
float parse_cpu_utilization (const char *out)
{
float cpu;
sscanf (out, "%f", &cpu);
return cpu;
}
int main(void)
{
unsigned pid = 1;
// getting array with process children
int process_children[MAX_CHILDREN] = { 0 };
process_children[0] = pid; // parent PID as first element
find_children(pid, process_children);
// calculating summary processor utilization
unsigned i;
float common_cpu_usage = 0.0;
for (i = 0; i < sizeof(process_children)/sizeof(int); ++i)
{
if (process_children[i] > 0)
{
char *command = (char*)malloc(1000);
sprintf (command, "/bin/ps -p %i -o 'pcpu' --no-headers", process_children[i]);
common_cpu_usage += parse_cpu_utilization(system_output(command));
}
}
printf("%f\n", common_cpu_usage);
return 0;
}
Compile:
gcc -Wall -pedantic --std=gnu99 find_cpu.c
Enjoy!
Might not be the exact command. But you can do something like below to get cpu usage of various process and add it.
#ps -C sendmail,firefox -o pcpu= | awk '{s+=$1} END {print s}'
/proc/[pid]/stat Status information about the process. This is used by ps and made into human readable form.
Another way is to use cgroups and use cpuacct.
http://www.kernel.org/doc/Documentation/cgroups/cpuacct.txt
https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpuacct.html
Here's one-liner to compute total CPU for all processes. You can adjust it by passing column filter into top output:
top -b -d 5 -n 2 | awk '$1 == "PID" {block_num++; next} block_num == 2 {sum += $9;} END {print sum}'

CUDA performance test

I'm writing a simple CUDA program for performance test.
This is not related to vector calculation, but just for a simple (parallel) string conversion.
#include <stdio.h>
#include <string.h>
#include <cuda_runtime.h>
#define UCHAR unsigned char
#define UINT32 unsigned long int
#define CTX_SIZE sizeof(aes_context)
#define DOCU_SIZE 4096
#define TOTAL 100000
#define BBLOCK_SIZE 500
UCHAR pH_TXT[DOCU_SIZE * TOTAL];
UCHAR pH_ENC[DOCU_SIZE * TOTAL];
UCHAR* pD_TXT;
UCHAR* pD_ENC;
__global__
void TEST_Encode( UCHAR *a_input, UCHAR *a_output )
{
UCHAR *input;
UCHAR *output;
input = &(a_input[threadIdx.x * DOCU_SIZE]);
output = &(a_output[threadIdx.x * DOCU_SIZE]);
for ( int i = 0 ; i < 30 ; i++ ) {
if ( (input[i] >= 'a') && (input[i] <= 'z') ) {
output[i] = input[i] - 'a' + 'A';
}
else {
output[i] = input[i];
}
}
}
int main(int argc, char** argv)
{
struct cudaDeviceProp xCUDEV;
cudaGetDeviceProperties(&xCUDEV, 0);
// Prepare Source
memset(pH_TXT, 0x00, DOCU_SIZE * TOTAL);
for ( int i = 0 ; i < TOTAL ; i++ ) {
strcpy((char*)pH_TXT + (i * DOCU_SIZE), "hello world, i need an apple.");
}
// Allocate vectors in device memory
cudaMalloc((void**)&pD_TXT, DOCU_SIZE * TOTAL);
cudaMalloc((void**)&pD_ENC, DOCU_SIZE * TOTAL);
// Copy vectors from host memory to device memory
cudaMemcpy(pD_TXT, pH_TXT, DOCU_SIZE * TOTAL, cudaMemcpyHostToDevice);
// Invoke kernel
int threadsPerBlock = BLOCK_SIZE;
int blocksPerGrid = (TOTAL + threadsPerBlock - 1) / threadsPerBlock;
printf("Total Task is %d\n", TOTAL);
printf("block size is %d\n", threadsPerBlock);
printf("repeat cnt is %d\n", blocksPerGrid);
TEST_Encode<<<blocksPerGrid, threadsPerBlock>>>(pD_TXT, pD_ENC);
cudaMemcpy(pH_ENC, pD_ENC, DOCU_SIZE * TOTAL, cudaMemcpyDeviceToHost);
// Free device memory
if (pD_TXT) cudaFree(pD_TXT);
if (pD_ENC) cudaFree(pD_ENC);
cudaDeviceReset();
}
And when i change BLOCK_SIZE value from 2 to 1000, I got a following duration time (from NVIDIA Visual Profiler)
TOTAL BLOCKS BLOCK_SIZE Duration(ms)
100000 50000 2 28.22
100000 10000 10 22.223
100000 2000 50 12.3
100000 1000 100 9.624
100000 500 200 10.755
100000 250 400 29.824
100000 200 500 39.67
100000 100 1000 81.268
My GPU is GeForce GT520 and max threadsPerBlock value is 1024, so I predicted that I would get best performance when BLOCK is 1000, but the above table shows different result.
I can't understand why Duration time is not linear, and how can I fix this problem. (or how can I find optimized Block value (mimimum Duration time)
It seems 2, 10, 50 threads doesn't utilize the capabilities of the gpu since its design is to start much more threads.
Your card has compute capability 2.1.
Maximum number of resident threads per multiprocessor = 1536
Maximum number of threads per block = 1024
Maximum number of resident blocks per multiprocessor = 8
Warp size = 32
There are two issues:
1.
You try to occupy so much register memory per thread that it will definetly is outsourced to slow local memory space if your block sizes increases.
2.
Perform your tests with multiple of 32 since this is the warp size of your card and many memory operations are optimized for thread sizes with multiple of the warp size.
So if you use only around 1024 (1000 in your case) threads per block 33% of your gpu is idle since only 1 block can be assigned per SM.
What happens if you use the following 100% occupancy sizes?
128 = 12 blocks -> since only 8 can be resident per sm the block execution is serialized
192 = 8 resident blocks per sm
256 = 6 resident blocks per sm
512 = 3 resident blocks per sm

Converting jiffies to milli seconds

How do I manually convert jiffies to milliseconds and vice versa in Linux? I know kernel 2.6 has a function for this, but I'm working on 2.4 (homework) and though I looked at the code it uses lots of macro constants which I have no idea if they're defined in 2.4.
As a previous answer said, the rate at which jiffies increments is fixed.
The standard way of specifying time for a function that accepts jiffies is using the constant HZ.
That's the abbreviation for Hertz, or the number of ticks per second. On a system with a timer tick set to 1ms, HZ=1000. Some distributions or architectures may use another number (100 used to be common).
The standard way of specifying a jiffies count for a function is using HZ, like this:
schedule_timeout(HZ / 10); /* Timeout after 1/10 second */
In most simple cases, this works fine.
2*HZ /* 2 seconds in jiffies */
HZ /* 1 second in jiffies */
foo * HZ /* foo seconds in jiffies */
HZ/10 /* 100 milliseconds in jiffies */
HZ/100 /* 10 milliseconds in jiffies */
bar*HZ/1000 /* bar milliseconds in jiffies */
Those last two have a bit of a problem, however, as on a system with a 10 ms timer tick, HZ/100 is 1, and the precision starts to suffer. You may get a delay anywhere between 0.0001 and 1.999 timer ticks (0-2 ms, essentially). If you tried to use HZ/200 on a 10ms tick system, the integer division gives you 0 jiffies!
So the rule of thumb is, be very careful using HZ for tiny values (those approaching 1 jiffie).
To convert the other way, you would use:
jiffies / HZ /* jiffies to seconds */
jiffies * 1000 / HZ /* jiffies to milliseconds */
You shouldn't expect anything better than millisecond precision.
Jiffies are hard-coded in Linux 2.4. Check the definition of HZ, which is defined in the architecture-specific param.h. It's often 100 Hz, which is one tick every (1 sec/100 ticks * 1000 ms/sec) 10 ms.
This holds true for i386, and HZ is defined in include/asm-i386/param.h.
There are functions in include/linux/time.h called timespec_to_jiffies and jiffies_to_timespec where you can convert back and forth between a struct timespec and jiffies:
#define MAX_JIFFY_OFFSET ((~0UL >> 1)-1)
static __inline__ unsigned long
timespec_to_jiffies(struct timespec *value)
{
unsigned long sec = value->tv_sec;
long nsec = value->tv_nsec;
if (sec >= (MAX_JIFFY_OFFSET / HZ))
return MAX_JIFFY_OFFSET;
nsec += 1000000000L / HZ - 1;
nsec /= 1000000000L / HZ;
return HZ * sec + nsec;
}
static __inline__ void
jiffies_to_timespec(unsigned long jiffies, struct timespec *value)
{
value->tv_nsec = (jiffies % HZ) * (1000000000L / HZ);
value->tv_sec = jiffies / HZ;
}
Note: I checked this info in version 2.4.22.
I found this sample code on kernelnewbies. Make sure you link with -lrt
#include <unistd.h>
#include <time.h>
#include <stdio.h>
int main()
{
struct timespec res;
double resolution;
printf("UserHZ %ld\n", sysconf(_SC_CLK_TCK));
clock_getres(CLOCK_REALTIME, &res);
resolution = res.tv_sec + (((double)res.tv_nsec)/1.0e9);
printf("SystemHZ %ld\n", (unsigned long)(1/resolution + 0.5));
return 0;
}
To obtain the USER_HZ value (see the comments under the accepted answer) using CLI:
getconf CLK_TCK

Resources