I'm trying to understand shared-memory architectures, especially ccNUMA systems. I have read about first touch policy, but I am still a bit confused. I am trying to understand how data are distributed in memory pages. Let's say we have the example below. Regarding first touch policy, is it true that the processor performing the first write will take the page and this page will contain all array elements from A[0] to A[199] included? Is it still true even if the number of bytes is less than the page size? Will this be a whole page (page number 0 for example)? I assume that there are 5 threads.
#pragma omp parallel for
for(int i=0 ; i<1000 ; i++){
A[i] = i; // dummy values
}
Related
Let's make an example:
i want vector dot product made concurrently (it's not my case, this is only an example) so i have 2 large input vectors and a large output vector with the same size. the work items aviable are less then the sizes of these vectors. How can i make this dot product in opencl if the work items are less then the size of the vectors? Is this possible? Or i have just to make some tricks?
Something like:
for(i = 0; i < n; i++){
output[i] = input1[i]*input2[i];
}
with n > available work items
If by "available work items" you mean you're running into the maximum given by CL_DEVICE_MAX_WORK_ITEM_SIZES, you can always enqueue your kernel multiple times for different ranges of the array.
Depending on your actual workload, it may be more sensible to make each work item perform more work though. In the simplest case, you can use the SIMD types such as float4, float8, float16, etc. and operate on large chunks like that in one go. As always though, there is no replacement for trying different approaches and measuring the performance of each.
Divide and conquer data. If you keep workgroup size as an integer divident of global work size, then you can have N workgroup launches perhaps k of them at once per kernel launch. So you should just launch N/k kernels each with k*workgroup_size workitems and proper addressing of buffers inside kernels.
When you have per-workgroup partial sums of partial dot products(with multiple in-group reduction steps), you can simply sum them on CPU or on whichever device that data is going to.
Whats the best way to generate 1000K text files? (with Perl and Windows 7) I want to generate those text files in as possible in less time (possibly withing 5 minutes)? Right now I am using Perl threading with 50 threads. Still it is taking longer time.
What will be best solution? Do I need to increase thread count? Is there any other way to write 1000K files in under five minutes? Here is my code
$start = 0;
$end = 10000;
my $start_run = time();
my #thr = "";
for($t=0; $t < 50; $t++) {
$thr[$t] = threads->create(\&files_write, $start, $end);
#start again from 10000 to 20000 loop
.........
}
for($t=0; $t < 50; $t++) {
$thr[$t]->join();
}
my $end_run = time();
my $run_time = $end_run - $start_run;
print "Job took $run_time seconds\n";
I don't want return result of those threads. I used detach() also but it didn't worked me.
For generating 500k files (with only size of 20kb) it took 1564 seconds (26min). Can I able to achieve within 5min?
Edited: The files_write will only take values from array predefined structure and write into file. thats it.
Any other solution?
The time needed depends on lots of factors, but heavy threading is probably not the solution:
creating files in the same directory at the same time needs probably locking in the OS, so it's better done not too much in parallel
the layout how the data gets written on disk depend on the amount of data and on how many writes you do in parallel. A bad layout can impact the performance a lot, especially on HDD. But even a SDD cannot do lots of parallel writes. This all depends a lot on the disk you use, e.g. it is a desktop system which is optimized for sequential writes or is it a server system which can do more parallel writes as required by databases.
... lots of other factors, often depending on the system
I would suggest that you use a thread pool with a fixed size of threads to benchmark, what the optimal number of threads is for your specific hardware. E.g. start with a single thread and slowly increase the number. My guess is, that the optimal number might be between factor 0.5 and 4 of the number of processor cores you have, but like I said, it heavily depends on your real hardware.
The slow performance is probably due to Windows having to lock the filesystem down while creating files.
If it is only for testing - and not critical data - a RAMdisk may be ideal. Try Googling DataRam RAMdisk.
My program reads a file, interleaving it as below:
The file to be read is large. It is split into four parts that are then split into many blocks. My program first reads block 1 of part 1, then jumps to block 1 of part 2, and so on. Then back to the block 2 of part 1, ..., as such.
The performance drops in tests. I believe the reason is that the page cache feature of kernel doesn't work efficiently in such situations. But the file is too large to mmap(), and the file is located in NFS.
How do I speed up reading in such a situation? Any comments and suggestions are welcome.
You may want to use posix_fadvise() to give the system hints on your usage, eg. use POSIX_FADV_RANDOM to disable readahead, and possibly use POSIX_FADV_WILLNEED to have the system try to read the next block into the page cache before you need it (if you can predict this).
You could also try to use POSIX_FADV_DONTNEED once you are done reading a block to have the system free the underlying cache pages, although this might not be necessary
For each pair of blocks, read both in, process the first, and push the second on to a stack. When you come to the end of the file, start shifting values off the bottom of the stack, processing them one by one.
You can break up the reading into linear chunks. For example, if your code looks like this:
int index = 0;
for (int block=0; block<n_blocks; ++block) {
for (int part=0; part<n_parts; ++part) {
seek(file,part*n_blocks+block);
data[part] = readChar(file);
}
send(data);
}
change it to this:
for (int chunk=0; chunk<n_chunks; ++chunk) {
for (int part=0; part<n_parts; ++part) {
seek(file,part*n_blocks+chunk*n_blocks_per_chunk);
for (int block=0; block<n_blocks_per_chunk; ++block) {
data[block*n_parts+part] = readChar(file);
}
}
send(data);
}
Then optimize n_blocks_per_chunk for your cache.
When you have a dynamically allocated buffer that varies its size at runtime in unpredictable ways (for example a vector or a string) one way to optimize its allocation is to only resize its backing store on powers of 2 (or some other set of boundaries/thresholds), and leave the extra space unused. This helps to amortize the cost of searching for new free memory and copying the data across, at the expense of a little extra memory use. For example the interface specification (reserve vs resize vs trim) of many C++ stl containers have such a scheme in mind.
My question is does the default implementation of the malloc/realloc/free memory manager on Linux 3.0 x86_64, GLIBC 2.13, GCC 4.6 (Ubuntu 11.10) have such an optimization?
void* p = malloc(N);
... // time passes, stuff happens
void* q = realloc(p,M);
Put another way, for what values of N and M (or in what other circumstances) will p == q?
From the realloc implementation in glibc trunk at http://sources.redhat.com/git/gitweb.cgi?p=glibc.git;a=blob;f=malloc/malloc.c;h=12d2211b0d6603ac27840d6f629071d1c78586fe;hb=HEAD
First, if the memory has been obtained via mmap() instead of sbrk(), which glibc malloc does for large requests, >= 128 kB by default IIRC:
if (chunk_is_mmapped(oldp))
{
void* newmem;
#if HAVE_MREMAP
newp = mremap_chunk(oldp, nb);
if(newp) return chunk2mem(newp);
#endif
/* Note the extra SIZE_SZ overhead. */
if(oldsize - SIZE_SZ >= nb) return oldmem; /* do nothing */
/* Must alloc, copy, free. */
newmem = public_mALLOc(bytes);
if (newmem == 0) return 0; /* propagate failure */
MALLOC_COPY(newmem, oldmem, oldsize - 2*SIZE_SZ);
munmap_chunk(oldp);
return newmem;
}
(Linux has mremap(), so in practice this is what is done).
For smaller requests, a few lines below we have
newp = _int_realloc(ar_ptr, oldp, oldsize, nb);
where _int_realloc is a bit big to copy-paste here, but you'll find it starting at line 4221 in the link above. AFAICS, it does NOT do the constant factor optimization increase that e.g. the C++ std::vector does, but rather allocates exactly the amount requested by the user (rounded up to the next chunk boundaries + alignment stuff and so on).
I suppose the idea is that if the user wants this factor of 2 size increase (or any other constant factor increase in order to guarantee logarithmic efficiency when resizing multiple times), then the user can implement it himself on top of the facility provided by the C library.
Perhaps you can use malloc_usable_size (google for it) to find the answer experimentally. This function, however, seems undocumented, so you will need to check out if it is still available at your platform.
See also How to find how much space is allocated by a call to malloc()?
Assume I have the following code:
int x[200];
void thread1() {
for(int i = 0; i < 100; i++)
x[i*2] = 1;
}
void thread2() {
for(int i = 0; i < 100; i++)
x[i*2 + 1] = 1;
}
Is the code correct in x86-64 memory model (from what I understand it is) assuming the page was configured with default write cache policy in Linux? What is the impact on performance of such code (from what I understand - none)?
PS. As of performance - I am mostly interested in Sandy Bridge.
EDIT: As of expectation - I want to write to aligned locations from different threads. I expect the upper code after finishing and barrier to contains {1,1,1, ...} in x rather then {0,1,0,1,...} or {1,0,1,0,...}.
If I understand correctly the writes will eventually propagate by snooping requests . The Sandy Bridge uses Quick Path between cores so the snooping would not hit FSB but would use much quicker interconnection. As it is not based on cache-invalidation-on-write it should be 'fairly' quick although I wasn't able to find what is the overhead of conflict resolution (but probably lower then L3 write).
Source
EDIT: According to IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual clean hit have impact of 43 cycles and dirty hit have impact of 60 cycles (compared with 4 cycles normal overhead for L1, 12 for L2 and 26-31 for L3).