Why the physical memory in Linux is allocated linearly increased rather than at once? - linux

I had wrote a program as below which allocated about 1.2G memory at once, and I tested it on Linux. Then I found
If I defined the macro *WRITE_MEM*, the physical memory usage (inspected by the command top) will increase linearly.
If I didn't define the macro, the physical memory usage is very small (about hundreds of kilobytes) and not changed verly large.
I dont's understand the phenomenon.
#include <iostream>
#include <cmath>
#include <cstdlib>
using namespace std;
float sum = 0.;
int main (int argc, char** argv)
{
float* pf = (float*) malloc(1024*1024*300*4);
float* p = pf;
for (int i = 0; i < 300; i++) {
cout << i << "..." << endl;
float* qf = (float *) malloc(1024*1024*4);
float* q = qf;
for (int j = 0; j < 1024*1024; j++) {
*q++ = sin(j*j*j*j) ;
}
q = qf;
for (int j = 0; j < 1024*1024; j++) {
#ifdef WRITE_MEM // The physical memory usage will increase linearly
*p++ = *q++;
sum += *q;
#else // The physical memory usage is small and will not change
p++;
// or
// sum += *p++;
#endif
}
free(qf);
}
free(pf);
return 0;
}

Linux allocates virtual memory immediately, but doesn't back it with physical memory until the pages are actually used. This causes processes to only use the physical memory they actually require, leaving the unused memory available for the rest of the system.

Related

Computing 2 independent matrix-vector multiplications with OpenMP tasks is slower than processing each one in turn

I am computing the matrix vector product of two different matrices. I.e. I need to compute A*v and B*v. I am seeing speedup by using thread and vector level parallelism. I have noticed that, for the matrix sizes I have tested, the speedup curve flattens out after ~8 threads (for appropriately large matrix sizes). As I am on a 16 thread machine, I thought it natural to throw another 4 or 5 threads at the other matrix-vector multiplication and compute both at the same time by assigning each to a task. However, as the problem size increases, computing with tasks in this thread range is consistently worse than computing without. See the figure below. This is not intuitive to me, I would have expected tasks to be out performing the sequential function calls.
// This is consistently worse
#pragma omp task
matrixVectorMultiplicationParallel(A, V, resultsA, matrixSize, numThreads/2);
#pragma omp task
matrixVectorMultiplicationParallel(B, V, resultsB, matrixSize, numThreads/2);
#pragma omp taskwait
// this is consistently better (this is corresponds to the blue line in the figures)
matrixVectorMultiplicationParallel(A, V, resultsA, matrixSize, numThreads);
matrixVectorMultiplicationParallel(B, V, resultsB, matrixSize, numThreads);
Have I misunderstood some aspect of tasking, or is this simply the overhead of creating tasks exceeding their usefulness here? Why, for instance, is computing A*v and B*v with 5 threads each at the same time (10 threads total) equal to computing A*v then B*v
This is being used as part of a larger recursive algorithm, in which each recusive call computes a matrix-vector multiplication on smaller and smaller matrices.
Full code:
#include <time.h>
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <assert.h>
#include <stdbool.h>
#include <omp.h>
#include <assert.h>
#define THREAD_RANGE 16 // Run for 1:THREAD_RANGE threads
#define NUM_AVERAGES 10 // take the average of 5 timings for each matrix size, and each number of threads
#define MATRIX_SIZE 3000
// gcc -fopenmp matrix_tasking.c -o matrix_tasking -O3 -Wall -Werror
double parallelTimings[THREAD_RANGE];
void matrixVectorMultiplicationSequential(double *restrict M, double *restrict V, double *restrict results, unsigned long matrixSize)
{
int i, j;
for (i = 0; i < matrixSize; i++)
{
double *MHead = &M[i * matrixSize];
double tmp = 0;
for (j = 0; j < matrixSize; j++)
{
tmp += MHead[j] * V[j];
}
results[i] = tmp;
}
}
void matrixVectorMultiplicationParallel(double *restrict M, double *restrict V, double *restrict results, unsigned long matrixSize, int numThreads)
{
omp_set_num_threads(numThreads);
unsigned long i, j;
#pragma omp parallel for private(j)
for (i = 0; i < matrixSize; i++)
{
double tmp = 0;
double *MHead = &M[i * matrixSize];
#pragma omp simd reduction(+ \
: tmp)
for (j = 0; j < matrixSize; j++)
{
tmp += MHead[j] * V[j];
}
results[i] = tmp;
}
}
void doParallelComputation(double *restrict A, double *restrict B, double *restrict V, double *restrict resultsA, double *restrict resultsB, unsigned long matrixSize, int numThreads)
{
#pragma omp task
matrixVectorMultiplicationParallel(A, V, resultsA, matrixSize, numThreads/2);
#pragma omp task
matrixVectorMultiplicationParallel(B, V, resultsB, matrixSize, numThreads/2);
#pragma omp taskwait
}
void genRandVector(double *S, unsigned long size)
{
srand(time(0));
unsigned long i;
#pragma omp parallel for private(i)
for (i = 0; i < size; i++)
{
double n = rand() % 3;
S[i] = n;
}
}
void doSequentialComputation(double *restrict A, double *restrict B, double *restrict V, double *restrict resultsA, double *restrict resultsB, unsigned long matrixSize)
{
matrixVectorMultiplicationSequential(A, V, resultsA, matrixSize);
matrixVectorMultiplicationSequential(B, V, resultsB, matrixSize);
}
void genRandMatrix(double *A, unsigned long size)
{
srand(time(0));
unsigned long i, j;
for (i = 0; i < size; i++)
{
for (j = 0; j < size; j++)
{
double n = rand() % 3;
A[i * size + j] = n;
}
}
}
int main(int argc, char *argv[])
{
struct timespec start, finish;
double elapsed;
unsigned long matrixSize = 100;
double *V = (double *)malloc(matrixSize * sizeof(double));
double *seqVA = (double *)malloc(matrixSize * sizeof(double)); // Store the results of A*v in the sequential implementation here
double *parVA = (double *)malloc(matrixSize * sizeof(double)); // Store the results of A*v in the parallel implementation here
double *seqVB = (double *)malloc(matrixSize * sizeof(double)); // Store the results of B*v in the sequential implementation here
double *parVB = (double *)malloc(matrixSize * sizeof(double)); // Store the results of B*v in the parallel implementation here
double *A = (double *)malloc(matrixSize * matrixSize * sizeof(double)); // First matrix to multiply by V
double *B = (double *)malloc(matrixSize * matrixSize * sizeof(double)); // Second matrix to multiply by V
genRandVector(V, matrixSize);
genRandMatrix(A, matrixSize);
genRandMatrix(B, matrixSize);
double sequentialTiming = 0;
for (int a = 0; a < NUM_AVERAGES; a++)
{
clock_gettime(CLOCK_MONOTONIC, &start);
doSequentialComputation(A, B, V, seqVA, seqVB, matrixSize);
clock_gettime(CLOCK_MONOTONIC, &finish);
elapsed = (finish.tv_sec - start.tv_sec);
elapsed += (finish.tv_nsec - start.tv_nsec) / 1000000000.0;
sequentialTiming += elapsed;
for (int t = 1; t <= THREAD_RANGE; t++)
{
clock_gettime(CLOCK_MONOTONIC, &start);
omp_set_num_threads(t);
#pragma omp parallel
{
#pragma omp single
doParallelComputation(A, B, V, parVA, parVB, matrixSize, t);
}
// doParallelComputation(A, B, V, parVA, parVB, matrixSize, t);
clock_gettime(CLOCK_MONOTONIC, &finish);
elapsed = (finish.tv_sec - start.tv_sec);
elapsed += (finish.tv_nsec - start.tv_nsec) / 1000000000.0;
parallelTimings[t - 1] += elapsed;
// parallelTiming += elapsed;
for (int i = 0; i < matrixSize; i++)
{
assert(fabs(seqVA[i] - parVA[i]) < 0.01);
}
}
}
sequentialTiming /= NUM_AVERAGES;
printf("Sequential: %f \n", sequentialTiming);
printf("Parallel: ");
for (int t = 0; t < THREAD_RANGE; t++)
{
parallelTimings[t] /= NUM_AVERAGES;
printf("%f ", parallelTimings[t]);
}
printf("\n");
free(seqVA);
free(parVA);
free(seqVB);
free(parVB);
free(A);
free(V);
free(B);
return 0;
}
I suspect the answers lie in the hardware, not so much in the thread or task level.
First, you say you have 16 threads, but are they hyperthreads ? By that I mean, do you have in fact 8 cores or 16 cores in your CPU ? Because such threads are meant to increase the utilisation of common resources in a core, such as FPUs. But if these are already saturated, then adding more threads does not increase the number of operations your machine can perform.
That would answer why your performance flattens at 8 threads, and thus also why adding more parallelism (ie two multiplications simultaneously) doesn’t further accelerate things.
Second, I think the rest of the difference, especially in large vector sizes, might be up to caches and/or memory bandwidth:
for each row result (ie each inner loop on j), you need the full input vector and a full row of the matrix. If all threads are executing the same multiplication, the they use the same input vector.
This means more pressure on cache and more memory bandwidth required to sustain computation if you’re computing both multiplications at the same time, and those are both finite resources. To see which is the real bottleneck you’d need to run some profiling.
So the effect you’re seeing could be due to:
same utilisation of computing resources, contrarily to what you’re suspecting,
less efficient utilisation of caches and memory bandwidth
Don't do this at all!
Use the appropriate BLAS routine which will already be parallelised, vectorised, cache-blocked, and portable to the next machine you happen to run on.
See, for instance, LAPACK dgemv and then consider which of the many BLAS libraries (LAPACK, MKL, ...) you want to use.

Large overhead in CUDA kernel launch outside GPU execution

I am measuring the running time of kernels, as seen from a CPU thread, by measuring the interval from before launching a kernel to after a cudaDeviceSynchronize (using gettimeofday). I have a cudaDeviceSynchronize before I start recording the interval. I also instrument the kernels to record the timestamp on the GPU (using clock64) at the start of the kernel by thread(0,0,0) of each block from block(0,0,0) to block(occupancy-1,0,0) to an array of size equal to number of SMs. Every thread at the end of the kernel code, updates the timestamp to another array (of the same size) at the index equal to the index of the SM it runs on.
The intervals calculated from the two arrays are 60-70% of that measured from the CPU thread.
For example, on a K40, while gettimeofday gives an interval of 140ms, the avg of intervals calculated from GPU timestamps is only 100ms. I have experimented with many grid sizes (15 blocks to 6K blocks) but have found similar behavior so far.
__global__ void some_kernel(long long *d_start, long long *d_end){
if(threadIdx.x==0){
d_start[blockIdx.x] = clock64();
}
//some_kernel code
d_end[blockIdx.x] = clock64();
}
Does this seem possible to the experts?
Does this seem possible to the experts?
I suppose anything is possible for code you haven't shown. After all, you may just have a silly bug in any of your computation arithmetic. But if the question is "is it sensible that there should be 40ms of unaccounted-for time overhead on a kernel launch, for a kernel that takes ~140ms to execute?" I would say no.
I believe the method I outlined in the comments is reasonably accurate. Take the minimum clock64() timestamp from any thread in the grid (but see note below regarding SM restriction). Compare it to the maximum time stamp of any thread in the grid. The difference will be comparable to the reported execution time of gettimeofday() to within 2 percent, according to my testing.
Here is my test case:
$ cat t1040.cu
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define LS_MAX 2000000000U
#define MAX_SM 64
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
__device__ int result;
__device__ unsigned long long t_start[MAX_SM];
__device__ unsigned long long t_end[MAX_SM];
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
__device__ __inline__ uint32_t __mysmid(){
uint32_t smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}
__global__ void kernel(unsigned ls){
unsigned long long int ts = clock64();
unsigned my_sm = __mysmid();
atomicMin(t_start+my_sm, ts);
// junk code to waste time
int tv = ts&0x1F;
for (unsigned i = 0; i < ls; i++){
tv &= (ts+i);}
result = tv;
// end of junk code
ts = clock64();
atomicMax(t_end+my_sm, ts);
}
// optional command line parameter 1 = kernel duration, parameter 2 = number of blocks, parameter 3 = number of threads per block
int main(int argc, char *argv[]){
unsigned ls;
if (argc > 1) ls = atoi(argv[1]);
else ls = 1000000;
if (ls > LS_MAX) ls = LS_MAX;
int num_sms = 0;
cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, 0);
cudaCheckErrors("cuda get attribute fail");
int gpu_clk = 0;
cudaDeviceGetAttribute(&gpu_clk, cudaDevAttrClockRate, 0);
if ((num_sms < 1) || (num_sms > MAX_SM)) {printf("invalid sm count: %d\n", num_sms); return 1;}
unsigned blks;
if (argc > 2) blks = atoi(argv[2]);
else blks = num_sms;
if ((blks < 1) || (blks > 0x3FFFFFFF)) {printf("invalid blocks: %d\n", blks); return 1;}
unsigned ntpb;
if (argc > 3) ntpb = atoi(argv[3]);
else ntpb = 256;
if ((ntpb < 1) || (ntpb > 1024)) {printf("invalid threads: %d\n", ntpb); return 1;}
kernel<<<1,1>>>(100); // warm up
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
unsigned long long *h_start, *h_end;
h_start = new unsigned long long[num_sms];
h_end = new unsigned long long[num_sms];
for (int i = 0; i < num_sms; i++){
h_start[i] = 0xFFFFFFFFFFFFFFFFULL;
h_end[i] = 0;}
cudaMemcpyToSymbol(t_start, h_start, num_sms*sizeof(unsigned long long));
cudaMemcpyToSymbol(t_end, h_end, num_sms*sizeof(unsigned long long));
unsigned long long htime = dtime_usec(0);
kernel<<<blks,ntpb>>>(ls);
cudaDeviceSynchronize();
htime = dtime_usec(htime);
cudaMemcpyFromSymbol(h_start, t_start, num_sms*sizeof(unsigned long long));
cudaMemcpyFromSymbol(h_end, t_end, num_sms*sizeof(unsigned long long));
cudaCheckErrors("some error");
printf("host elapsed time (ms): %f \n device sm clocks:\n start:", htime/1000.0f);
unsigned long long max_diff = 0;
for (int i = 0; i < num_sms; i++) {printf(" %12lu ", h_start[i]);}
printf("\n end: ");
for (int i = 0; i < num_sms; i++) {printf(" %12lu ", h_end[i]);}
for (int i = 0; i < num_sms; i++) if ((h_start[i] != 0xFFFFFFFFFFFFFFFFULL) && (h_end[i] != 0) && ((h_end[i]-h_start[i]) > max_diff)) max_diff=(h_end[i]-h_start[i]);
printf("\n max diff clks: %lu\nmax diff kernel time (ms): %f\n", max_diff, max_diff/(float)(gpu_clk));
return 0;
}
$ nvcc -o t1040 t1040.cu -arch=sm_35
$ ./t1040 1000000 1000 128
host elapsed time (ms): 2128.818115
device sm clocks:
start: 3484744 3484724
end: 2219687393 2228431323
max diff clks: 2224946599
max diff kernel time (ms): 2128.117432
$
Notes:
This code can only be run on a cc3.5 or higher GPU due to the use of 64-bit atomicMin and atomicMax.
I've run it on a variety of grid configurations, on both a GT640 (very low end cc3.5 device) and K40c (high end) and the timing results between host and device agree to within 2% (for reasonably long kernel execution times. If you pass 1 as the command line parameter, with very small grid sizes, the kernel execution time will be very short (nanoseconds) whereas the host will see about 10-20us. This is kernel launch overhead being measured. So the 2% number is for kernels that take much longer than 20us to execute).
It accepts 3 (optional) command line parameters, the first of which varies the amount of time the kernel will execute.
My timestamping is done on a per-SM basis, because the clock64() resource is indicated to be a per-SM resource. The sm clocks are not guaranteed to be synchronized between SMs.
You can modify the grid dimensions. The second optional command line parameter specifies the number of blocks to launch. The third optional command line parameter specifies the number of threads per block. The timing methodology I have shown here should not be dependent on number of blocks launched or number of threads per block. If you specify fewer blocks than SMs, the code should ignore "unused" SM data.

pthreads code not scaling up

I wrote the following very simple pthread code to test how it scales up. I am running the code on a machine with 8 logical processors and at no time do I create more than 8 threads (to avoid context switching).
With increasing number of threads, each thread has to do lesser amount of work. Also, it is evident from the code that there are no shared Data structures between the threads which might be a bottleneck. But still, my performance degrades as I increase the number of threads.
Can somebody tell me what am I doing wrong here.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int NUM_THREADS = 3;
unsigned long int COUNTER = 10000000000000;
unsigned long int LOOP_INDEX;
void* addNum(void *data)
{
unsigned long int sum = 0;
for(unsigned long int i = 0; i < LOOP_INDEX; i++) {
sum += 100;
}
return NULL;
}
int main(int argc, char** argv)
{
NUM_THREADS = atoi(argv[1]);
pthread_t *threads = (pthread_t*)malloc(sizeof(pthread_t) * NUM_THREADS);
int rc;
clock_t start, diff;
LOOP_INDEX = COUNTER/NUM_THREADS;
start = clock();
for (int t = 0; t < NUM_THREADS; t++) {
rc = pthread_create((threads + t), NULL, addNum, NULL);
if (rc) {
printf("ERROR; return code from pthread_create() is %d", rc);
exit(-1);
}
}
void *status;
for (int t = 0; t < NUM_THREADS; t++) {
rc = pthread_join(threads[t], &status);
}
diff = clock() - start;
int sec = diff / CLOCKS_PER_SEC;
printf("%d",sec);
}
Note: All the answers I found online said that the overhead of creating the threads is more than the work they are doing. To test it, I commented out everything in the "addNum()" function. But then, after doing that no matter how many threads I create, the time taken by the code is 0 seconds. So there is no overhead as such, I think.
clock() counts CPU time used, across all threads. So all that's telling you is that you're using a little bit more total CPU time, which is exactly what you would expect.
It's the total wall clock elapsed time which should be going down if your parallelisation is effective. Measure that with clock_gettime() specifying the CLOCK_MONOTONIC clock instead of clock().

How does garbage collector know a raw pointer and its referenced memory is no longer used

I'm new to garbage collection, I have read this page:
A garbage collector for C and C++, it gives a simple example in the page: Using the Garbage Collector: A simple example.
#include "gc.h"
#include <assert.h>
#include <stdio.h>
int main()
{
int i;
GC_INIT(); /* Optional on Linux/X86; see below. */
for (i = 0; i < 10000000; ++i)
{
int **p = (int **) GC_MALLOC(sizeof(int *));
int *q = (int *) GC_MALLOC_ATOMIC(sizeof(int));
assert(*p == 0);
*p = (int *) GC_REALLOC(q, 2 * sizeof(int));
if (i % 100000 == 0)
printf("Heap size = %d\n", GC_get_heap_size());
}
return 0;
}
Here, *p is a pointer in garbage collector managed memory, and it point to a memory also inside the managed memory.
I'm curious that how does the garbage collector know that the two memory allocated in a previous for loop is leaked and should be reclaimed in the next for loop.

C++ 11 std::thread strange behavior

I am experimenting a bit with std::thread and C++11, and I am encountering strange behaviour.
Please have a look at the following code:
#include <cstdlib>
#include <thread>
#include <vector>
#include <iostream>
void thread_sum_up(const size_t n, size_t& count) {
size_t i;
for (i = 0; i < n; ++i);
count = i;
}
class A {
public:
A(const size_t x) : x_(x) {}
size_t sum_up(const size_t num_threads) const {
size_t i;
std::vector<std::thread> threads;
std::vector<size_t> data_vector;
for (i = 0; i < num_threads; ++i) {
data_vector.push_back(0);
threads.push_back(std::thread(thread_sum_up, x_, std::ref(data_vector[i])));
}
std::cout << "Threads started ...\n";
for (i = 0; i < num_threads; ++i)
threads[i].join();
size_t sum = 0;
for (i = 0; i < num_threads; ++i)
sum += data_vector[i];
return sum;
}
private:
const size_t x_;
};
int main(int argc, char* argv[]) {
const size_t x = atoi(argv[1]);
const size_t num_threads = atoi(argv[2]);
A a(x);
std::cout << a.sum_up(num_threads) << std::endl;
return 0;
}
The main idea here is that I want to specify a number of threads which do independent computations (in this case, simple increments).
After all threads are finished, the results should be merged in order to obtain an overall result.
Just to clarify: This is only for testing purposes, in order to get me understand how
C++11 threads work.
However, when compiling this code using the command
g++ -o threads threads.cpp -pthread -O0 -std=c++0x
on a Ubuntu box, I get very strange behaviour, when I execute the resulting binary.
For example:
$ ./threads 1000 4
Threads started ...
Segmentation fault (core dumped)
(should yield the output: 4000)
$ ./threads 100000 4
Threads started ...
200000
(should yield the output: 400000)
Does anybody has an idea what is going on here?
Thank you in advance!
Your code has many problems (see even thread_sum_up for about 2-3 bugs) but the main bug I found by glancing your code is here:
data_vector.push_back(0);
threads.push_back(std::thread(thread_sum_up, x_, std::ref(data_vector[i])));
See, when you push_back into a vector (I'm talking about data_vector), it can move all previous data around in memory. But then you take the address of (reference to) a cell for your thread, and then push back again (making the previous reference invalid)
This will cause you to crash.
For an easy fix - add data_vector.reserve(num_threads); just after creating it.
Edit at your request - some bugs in thread_sum_up
void thread_sum_up(const size_t n, size_t& count) {
size_t i;
for (i = 0; i < n; ++i); // see that last ';' there? means this loop is empty. it shouldn't be there
count = i; // You're just setting count to be i. why do that in a loop? Did you mean +=?
}
The cause of your crash might be that std::ref(data_vector[i]) being invalidated by the next push_back in data_vector. Since you know the number of threads, do a data_vector.reserve(num_threads) before you start spawning off the threads to keep the references from being invalidated.
As you resize the vector with the calls to push_back, it is likely to have to reallocate the storage space, causing the references to the contained values to be invalidated. This causes the thread to write to non-allocated memory, which is undefined behavior.
Your options are to pre-allocate the size you need (vector::reserve is one option), or choose a different container.

Resources