It seems that I have a problem with Linux IO performance. Working with a project I need to clear whole the file from the kernel space. I use the following code pattern:
for_each_mapping_page(mapping, index) {
page = read_mapping_page(mapping, index);
lock_page(page);
{ kmap // memset // kunmap }
set_page_dirty(page);
write_one_page(page, 1);
page_cache_release(page);
cond_resched();
}
All works fine but with large files (~3Gb+ for me) I see that my system stalls in a strange manner: while this operation is not completed I can't run anything. In other words, all the processes that exists before this operation runs fine, but if I try to run something while this operation I see nothing until it completed.
Is it a kernel's IO scheduling issue or may be I missed something? And how can I fix this problem?
Thanks.
UPD:
According to Kristof's suggestion I've reworked my code and now it looks like this:
headIndex = soff >> PAGE_CACHE_SHIFT;
tailIndex = eoff >> PAGE_CACHE_SHIFT;
/**
* doing the exact #headIndex .. #tailIndex range
*/
for (index = headIndex; index < tailIndex; index += nr_pages) {
nr_pages = min_t(int, ARRAY_SIZE(pages), tailIndex - index);
for (i = 0; i < nr_pages; i++) {
pages[i] = read_mapping_page(mapping, index + i, NULL);
if (IS_ERR(pages[i])) {
while (i--)
page_cache_release(pages[i]);
goto return_result;
}
}
for (i = 0; i < nr_pages; i++)
zero_page_atomic(pages[i]);
result = filemap_write_and_wait_range(mapping, index << PAGE_CACHE_SHIFT,
((index + nr_pages) << PAGE_CACHE_SHIFT) - 1);
for (i = 0; i < nr_pages; i++)
page_cache_release(pages[i]);
if (result)
goto return_result;
if (fatal_signal_pending(current))
goto return_result;
cond_resched();
}
As the result I've got better IO performance, but still have problems with huge IO activity while doing concurrent disk access within the same user as caused the operation.
Anyway, thanks for the suggestions.
In essence you're bypassing the kernels IO scheduler completely.
If you look at the ext2 implementation you'll see it never (well ok, once) calls write_one_page(). For large-scale data transfers it uses mpage_writepages() instead.
This uses the Block I/O interface, rather than immediately accessing the hardware. This means it passes through the IO scheduler. Large operations will not block the entire systems, as the scheduler will automatically ensure that other operations are interleaved with the large writes.
Related
Currently I'm working with Metal compute shaders and trying to understand how GPU threads synchronization works there.
I wrote a simple code but it doesn't work the way I expect it:
Consider I have threadgroup variable, which is array where all threads can produce an output simultaneously.
kernel void compute_features(device float output [[ buffer(0) ]],
ushort2 group_pos [[ threadgroup_position_in_grid ]],
ushort2 thread_pos [[ thread_position_in_threadgroup]],
ushort tid [[ thread_index_in_threadgroup ]])
{
threadgroup short blockIndices[288];
float someValue = 0.0
// doing some work here which fills someValue...
blockIndices[thread_pos.y * THREAD_COUNT_X + thread_pos.x] = someValue;
//wait when all threads are done with calculations
threadgroup_barrier(mem_flags::mem_none);
output += blockIndices[thread_pos.y * THREAD_COUNT_X + thread_pos.x]; // filling out output variable with threads calculations
}
The code above doesn't work. Output variable doesn't contain all threads calculations, it contains only the value from the thread which was presumable the last at adding up a value to output. To me it seems like threadgroup_barrier does absolutely nothing.
Now, the interesting part. The code below works:
blockIndices[thread_pos.y * THREAD_COUNT_X + thread_pos.x] = someValue;
threadgroup_barrier(mem_flags::mem_none); //wait when all threads are done with calculations
if (tid == 0) {
for (int i = 0; i < 288; i ++) {
output += blockIndices[i]; // filling out output variable with threads calculations
}
}
And this code also works as good as the previous one:
blockIndices[thread_pos.y * THREAD_COUNT_X + thread_pos.x] = someValue;
if (tid == 0) {
for (int i = 0; i < 288; i ++) {
output += blockIndices[i]; // filling out output variable with threads calculations
}
}
To summarize: My code works as expected only when I'm handling threadgroup memory in one GPU thread, no matter what's the id of it, it can be the last thread in the threadgroup as well as the first one. And presense of threadgroup_barrier makes absolutely no difference. I also used threadgroup_barrier with mem_threadgroup flag, code still doesn't work.
I understand that I might be missing some very important detail and I would be happy if someone can point me out to my errors. Thanks in advance!
When you write output += blockIndices[...], all threads will try to perform this operation at the same time. But since output is not an atomic variable, this results in race conditions. It's not a threadsafe operation.
Your second solution is the correct one. You need to have just a single thread to collect the results (although you could split this up across multiple threads too). That it still works OK if you remove the barrier may just be due to luck.
I am using a MultiThreading class which creates the required number of threads in its own threadpool and deletes itself after use.
std::thread *m_pool; //number of threads according to available cores
std::mutex m_locker;
std::condition_variable m_condition;
std::atomic<bool> m_exit;
int m_processors
m_pool = new std::thread[m_processors + 1]
void func()
{
//code
}
for (int i = 0; i < m_processors; i++)
{
m_pool[i] = std::thread(func);
}
void reset(void)
{
{
std::lock_guard<std::mutex> lock(m_locker);
m_exit = true;
}
m_condition.notify_all();
for(int i = 0; i <= m_processors; i++)
m_pool[i].join();
delete[] m_pool;
}
After running through all tasks, the for-loop is supposed to join all running threads before delete[] is being executed.
But there seems to be one last thread still running, while the m_pool does not exist anymore.
This leads to the problem, that I can't close my program anymore.
Is there any way to check if all threads are joined or wait for all threads to be joined before deleting the threadpool?
Simple typo bug I think.
Your loop that has the condition i <= m_processors is a bug and will actually process one extra entry past the end of the array. This is an off-by-one bug. Suppose m_processors is 2. You'll have an array that contains 2 elements with indices [0] and [1]. Yet, you'll be reading past the end of the array, attempting to join with the item at index [2]. m_pool[2] is undefined memory and you're likely going to either crash or block forever there.
You likely intended i < m_processors.
The real source of the problem is addressed by Wick's answer. I will extend it with some tips that also solve your problem while improving other aspects of your code.
If you use C++11 for std::thread, then you shouldn't create your thread handles using operator new[]. There are better ways of doing that with other C++ constructs, which will make everything simpler and exception safe (you don't leak memory if an unexpected exception is thrown).
Store your thread objects in a std::vector. It will manage the memory allocation and deallocation for you (no more new and delete). You can use other more flexible containers such as std::list if you insert/delete threads dynamically.
Fill the vector in place with std::generate or similar
std::vector<std::thread> m_pool;
m_pool.reserve(n_processors);
// Fill the vector
std::generate_n( std::back_inserter(m_pool), m_processors,
[](){ return std::thread(func); } );
Join all the elements using range-for loop and delete handles using container's functions.
for( std::thread& t: m_pool ) {
t.join();
}
m_pool.clear();
I am wondering about how concurrency can be expressed without an explicit thread object, not the implementation, which probably would use threads or thread pools, but the language design related issues.
Q1: I wonder what would be lost if there was no thread object, what couldn't be done in such a language?
Q2: I also wonder about how this would be expressed, what ways were proposed or implemented as alternatives or complements to threads?
one possibility is the MPI-programm-model (GPU as well)
lets say you have the following code
for(int i=0; i < 100; i++) {
work(i);
}
the "normal" thread-based way would be the separation of the iteration-range into multiple subsets. So something like this
Thread-1:
for(int i=0; i < 50; i++) {
work(i);
}
Thread-2:
for(int i=50; i < 100; i++) {
work(i);
}
however in MPI/GPU you do something different.
the idea is, that every core execute the same(GPU) or at least
a similar (MPI) programm. the difference is, that each core uses
a different ID, which changes the behavior of the code.
mpi-style: (not exactly the MPI-syntax)
int rank = get_core_id();
int size = get_num_core();
int subset = 100 / size;
for (int i = rank * subset;i < (rand+1)*subset; i+) {
//each core will use a different range for i
work(i);
}
the next big thing is communication. Normally you need to use all of the synchronization-stuff manually. MPI is message-based, meaning that its not perfectly suited for classical shared-memory modells (every core has access to the same memory), but in a cluster system (many cores combined with a network) it works excellent. This is not only limited to supercomputers (they use basically only mpi-style stuff), but in the recent years a new type of core-architecture (manycores) was developed. They have a local so called Network-On-Chip, so each core can send/receive messages without having the problem with synchronization.
MPI contains not only simple messages, but higher constructs to automatically scatter and gather data to every core.
Example: (again not MPI-syntax)
int rank = get_core_id();
int size = get_num_core();
int data[100];
int result;
int results[size];
if (rank == 0) { //master-core only
fill_with_stuff(data);
}
scatter(0, data); //core-0 will send the data-content to all other cores
result = work(rank, data); // every core works on the same data
gather(0,result,results); //get all local results and store them in
//the results-array of core-0
an other solutions is the openMP-libary
here you declare parallel-blocks. the whole thread-part is done by the libary itself
example:
//this will split the for-loop automatically in 4 threads
#pragma omp parallel for num_threads(4)
for(int i=0; i < 100; i++) {
work(i);
}
the big advantage is, that its fast to write. thats it
you may get better performance with writing the threads on your own,
but it takes a lot more time and knowledge about synchronization
I have a doubt about a concurrent programming problem.
More specifically, we are working with the shared memory model(i.e. threads). The problem is: given a pool of N equivalent resources, there being the constraint that at a generic instant t there can only be one thread using a resource R, write a program that allocates these resources on demand to the threads that ask for them. To do this we have to use semaphores. Note that what these resources are and what the threads do with them is out of scope, the focus is on how the resources are managed and allocated. My professor gave us a C/Java-like pseudocode solution for the resources manager class:
class ResourcesManager {
semaphore mutex = 1;
semaphore availableResourcesSemaphore = N;
boolean available[N];
Resource resources[N];
public ResourcesManager(){
for(int i = 0; i < N; i++) {
available[i] = true;
resources[N] = new Resource();
}
}
public int acquireResource() {
int i = 0;
P(availableResourcesSemaphore);
P(mutex);
while(available[i]==false) i++;
available[i] = false;
V(mutex);
return i;
}
public void releaseResource(int i) {
P(mutex);
available[i] = true; //HERE IS THE PROBLEM
V(mutex);
V(availableResourcesSemaphore);
}
}
While I see that this solution works, there is something I don't get about the releaseResource(int i) method. Why is there the need of having the line marked with the comment "HERE IS THE PROBLEM":
available[i] = true;
executed in mutual exclusion? I have thought about it and to me it looks like nothing bad happens if we do otherwise.
What I mean is that while in the original solution we have:
P(mutex);
available[i] = true;
V(mutex);
we can replace these three lines simply with
available[i] = true;
and the solution is still correct.
Now, of course I see that mutual exclusion is needed when operating on the array "available" in the other method acquireResource(), and since the instruction
available[i] = true;
operates on the same variable it is more elegant and conceptually cleaner to operate on it in mutual exclusion too. On the other hand, as a beginner in concurrent programming, I don't think it's good to have mutual exclusion where it is not needed.
So am I right(the instruction can be executed without mutual exclusion) or am I missing something, and removing mutual exclusion causes some issues? One final note on the execution environment: it can be both uniprocessor or multiprocessor, meaning that the solution has to work for both cases. Thanks for your help!
in my application there is a small part of function,in which it will read files to get some information,the number of filecount would be utleast 50,So I thought of implementing threading.Say if the user is giving 50 files,I wanted to separate it as 5 *10, 5 thread should be created,so that each thread can handle 10 files which can speed up the process.And also from the below code you can see that some variables are common.I read some articles about threading and I am aware that only one thread should access a variable/contorl at a me(CCriticalStiuation can be used for that).For me as a beginner,I am finding hard to imlplement what I have learned about threading.Somebody please give me some idea with code shown below..thanks in advance
file read function://
void CMyClass::GetWorkFilesInfo(CStringArray& dataFilesArray,CString* dataFilesB,
int* check,DWORD noOfFiles,LPWSTR path)
{
CString cFilePath;
int cIndex =0;
int exceptionInd = 0;
wchar_t** filesForWork = new wchar_t*[noOfFiles];
int tempCheck;
int localIndex =0;
for(int index = 0;index < noOfFiles; index++)
{
tempCheck = *(check + index);
if(tempCheck == NOCHECKBOX)
{
*(filesForWork+cIndex) = new TCHAR[MAX_PATH];
wcscpy(*(filesForWork+cIndex),*(dataFilesB +index));
cIndex++;
}
else//CHECKED or UNCHECKED
{
dataFilesArray.Add(*(dataFilesB+index));
*(check + localIndex) = *(check + index);
localIndex++;
}
}
WorkFiles(&cFilePath,dataFilesArray,filesForWork,
path,
cIndex);
dataFilesArray.Add(cFilePath);
*(check + localIndex) = CHECKED;
}
I think you would be better off having just one thread to read all the files. The context switch between the threads together with the synchronization issues is really not worth the price in your example. The hard drive is one resource so imagine all five threads taking turns moving the hard drive read heads to various positions on the hard drive == not very effective.