Limitations of params, saved and session DML declarations - dml-lang

How much data can be processed using params, saved and session declarations?
How do these declaration affect performance and memory allocation/consumption (stack usage, data copy, etc.)?
What methods can be used in case of static/dynamic arrays with 10k-100k elements?

Params
An untyped param is expanded like a macro any time it is referenced, so resource consumption depends on its use. If you have a param with a large amount of data, then it usually means that the value is a compile-time list ([...]) with many elements, and you use a #foreach loop to process it. A #foreach loop is always unrolled, which gives long compile times and large generated code.
If a param is typed in a template, then that template evaluates the param once and stores a copy in heap-allocated memory. The data is shared between all instances of the device. Cost should be negligible.
Session
Data is heap-stored, one copy per device instance.
Saved
Pretty much like data, but adds a presumably negligible small per-module cost for attribute registration.
There's two more variants of data:
Constant C tables
header %{ const int data[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; %}
extern const int data;
Creates one super-cheap module-local instance.
Independent startup memoized method
independent startup memoized method data() -> (const int *) {
int *ret = new int[10];
for (local int i = 0; i < 10; i++) {
ret[i] = i;
}
return ret;
}
The data will be heap-allocated, initialized once, and shared across instances. Initialization is done by code, which saves size if it's easy to express the data programmatically, but can be cumbersome if it's just a table of irregular data.

Related

Debugging in threading building Blocks

I would like to program in threading building blocks with tasks. But how does one do the debugging in practice?
In general the print method is a solid technique for debugging programs.
In my experience with MPI parallelization, the right way to do logging is that each thread print its debugging information in its own file (say "debug_irank" with irank the rank in the MPI_COMM_WORLD) so that the logical errors can be found.
How can something similar be achieved with TBB? It is not clear how to access the thread number in the thread pool as this is obviously something internal to tbb.
Alternatively, one could add an additional index specifying the rank when a task is generated but this makes the code rather complicated since the whole program has to take care of that.
First, get the program working with 1 thread. To do this, construct a task_scheduler_init as the first thing in main, like this:
#include "tbb/tbb.h"
int main() {
tbb::task_scheduler_init init(1);
...
}
Be sure to compile with the macro TBB_USE_DEBUG set to 1 so that TBB's checking will be enabled.
If the single-threaded version works, but the multi-threaded version does not, consider using Intel Inspector to spot race conditions. Be sure to compile with TBB_USE_THREADING_TOOLS so that Inspector gets enough information.
Otherwise, I usually first start by adding assertions, because the machine can check assertions much faster than I can read log messages. If I am really puzzled about why an assertion is failing, I use printfs and task ids (not thread ids). Easiest way to create a task id is to allocate one by post-incrementing a tbb::atomic<size_t> and storing the result in the task.
If I'm having a really bad day and the printfs are changing program behavior so that the error does not show up, I use "delayed printfs". Stuff the printf arguments in a circular buffer, and run printf on the records later after the failure is detected. Typically for the buffer, I use an array of structs containing the format string and a few word-size values, and make the array size a power of two. Then an atomic increment and mask suffices to allocate slots. E.g., something like this:
const size_t bufSize = 1024;
struct record {
const char* format;
void *arg0, *arg1;
};
tbb::atomic<size_t> head;
record buf[bufSize];
void recf(const char* fmt, void* a, void* b) {
record* r = &buf[head++ & bufSize-1];
r->format = fmt;
r->arg0 = a;
r->arg1 = b;
}
void recf(const char* fmt, int a, int b) {
record* r = &buf[head++ & bufSize-1];
r->format = fmt;
r->arg0 = (void*)a;
r->arg1 = (void*)b;
}
The two recf routines record the format and the values. The casting is somewhat abusive, but on most architectures you can print the record correctly in practice with printf(r->format, r->arg0, r->arg1) even if the the 2nd overload of recf created the record.
~
~

why it's slowly when I parse a message of Google protocol buffer in multi-thread?

I try to parse many Google protocol buffer messages from a binary file generated by calling SerializeToString. I first load all Bytes into a heap memory by calling new function. I also have two arrays to store the Bytes begin address of a message in the heap memory and the Bytes count of the message.
Then I begin to parse message by calling ParseFromString.I want to quicken the procedure by using multi-thread.
In each thread, I pass the start index and end index of address array and Byte count array.
In parent process. the main code is:
struct ParsePara
{
char* str_buffer;
size_t* buffer_offset;
size_t* binary_string_length_array;
size_t start_idx;
size_t end_idx;
Flight_Ticket_Info* ticket_info_buffer_array;
};
//Flight_Ticket_Info is class of message
//offset_size is the count of message
ticket_array = new Flight_Ticket_Info[offset_size];
const int max_thread_count = 6;
pthread_t pthread_id_vec[max_thread_count];
CTimer thread_cost;
thread_cost.start();
vector<ParsePara*> para_vec;
const size_t each_count = ceil(float(offset_size) / max_thread_count);
for (size_t k = 0;k < max_thread_count;k++)
{
size_t start_idx = each_count * k;
size_t end_idx = each_count * (k+1);
if (start_idx >= offset_size)
break;
if (end_idx >= offset_size)
end_idx = offset_size;
ParsePara* cand_para_ptr = new ParsePara();
if (!cand_para_ptr)
{
_ERROR_EXIT(0,"[Malloc memory fail.]");
}
cand_para_ptr->str_buffer = m_valdata;//heap memory for storing Bytes of message
cand_para_ptr->buffer_offset = offset_array;//begin address of each message
cand_para_ptr->start_idx = start_idx;
cand_para_ptr->end_idx = end_idx;
cand_para_ptr->ticket_info_buffer_array = ticket_array;//array to store message
cand_para_ptr->binary_string_length_array = binary_length_array;//Bytes count of each message
para_vec.push_back(cand_para_ptr);
}
for(size_t k = 0 ;k < para_vec.size();k++)
{
int ret = pthread_create(&pthread_id_vec[k],NULL,parserFlightTicketForMultiThread,para_vec[k]);
if (0 != ret)
{
_ERROR_EXIT(0,"[Error] [create thread fail]");
}
}
for (size_t k = 0;k < para_vec.size();k++)
{
pthread_join(pthread_id_vec[k],NULL);
}
In each thread the thread function is:
void* parserFlightTicketForMultiThread(void* void_para_ptr)
{
ParsePara* para_ptr = (ParsePara*) void_para_ptr;
parserFlightTicketForMany(para_ptr->str_buffer,para_ptr->ticket_info_buffer_array,para_ptr->buffer_offset,
para_ptr->start_idx,para_ptr->end_idx,para_ptr->binary_string_length_array);
}
void parserFlightTicketForMany(const char* str_buffer,Flight_Ticket_Info* ticket_info_buffer_array,
size_t* buffer_offset,const size_t start_idx,const size_t end_idx,size_t* binary_string_length_array)
{
printf("start_idx:%d,end_idx:%d\n",start_idx,end_idx);
for (size_t k = start_idx;k < end_idx;k++)
{
if (k % 100000 == 0)
cout << k << endl;
size_t cand_offset = buffer_offset[k];
size_t binary_length = binary_string_length_array[k];
ticket_info_buffer_array[k].ParseFromString(string(&str_buffer[cand_offset],binary_length-1));
}
printf("done %ld %ld\n",start_idx,end_idx);
}
But multi-thread cost is more than one thread.
one thread cost is:40455623ms
My computer is 8 core and six thread cost is:131586865ms
Anyone can help me? thank you!
Some possible problems -- you'll have to experiment to determine which:
Protobuf parsing speed is often limited by memory bandwidth rather than CPU time, especially with a large input data set. In that case, more threads won't help, since all the cores are sharing bandwidth to main memory. Indeed, having multiple cores fighting over memory bandwidth could make the overall operation slower. Note that the biggest consumer of memory is not the input bytes but rather the parsed data objects -- that is, the output of parsing -- which are many times larger than the encoded data. To improve this problem, consider writing the parsing loop so that it fully-processes each message immediately after parsing, before moving on to the text message. That way, instead of allocating k protobuf objects, you only need to allocate one protobuf object per thread, and repeatedly reuse the same object for parsing. This way the object will (probably) stay in the core's private L1 cache and avoid consuming memory bandwidth; only the input bytes will be read over the main bus.
How are you loading data into RAM? Did you read() into a large array or did you mmap()? In the latter case the data is read from disk lazily -- it won't happen until you actually attempt to parse it. Even in the read() case, it could be that the data has been swapped out, creating similar effects. Either way, your threads are now not just fighting for memory bandwidth, but disk bandwidth, which is of course much slower. Having six threads reading separate parts of a big file will definitely be slower overall than having one thread read the whole file, because the operating system optimizes for sequential access.
Protobuf allocates memory during parsing. Many memory allocators take a lock while allocating new memory. Since all your threads are allocating tons and tons of objects in a tight loop, they will contend for this lock. Make sure you are using a thread-friendly memory allocator, such as Google's tcmalloc. Note that repeatedly reusing the same protobuf object in a parse-consume loop rather than allocating lots of different objects will also help immensely here, because the protobuf object will automatically reuse memory for sub-objects.
There may be a bug in your code and it might not be doing what you expect at all when multithreaded. For example, a bug might be causing all the threads to process the same data, rather than different data, and it could be that the data they're choosing happens to be bigger. Make sure you are testing that the results of your code are exactly the same when you run single-threaded vs. multi-threaded.
In short, if you want multiple cores to make your code faster, you have to think about not just what each core is doing, but what data is going in and out of each core, and how much the cores have to talk to each other. Ideally you want each core to operate all on its own without talking to anyone or anything; then you get maximum parallelism. That's not usually possible, of course, but the closer you can get to that, the better.
BTW, a random optimization for you:
ParseFromString(string(&str_buffer[cand_offset],binary_length-1))
Replace that with:
ParseFromArray(&str_buffer[cand_offset],binary_length-1)
Creating at std::string makes a copy of the data, which wastes time (and memory bandwidth). (This doesn't explain why threading is slow, though.)

Struct store large chunk of data

I have a data file with millions of rows and I wanted to read that and store in a struct.
public struct Sample
{
public int A;
public DateTime B;
}
Sample[] sample = new Sample[];
This definition gives me this error "Wrong number of indicies inside[]; expected 1"
How do I store data in struct (with less memory usage)? Array is that best of something else?
var reader = new StreamReader(File.OpenRead(#"C:\test.csv"));
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
var values = line.Split(';');
}
In order to allocate an array, you need to specify how many elements it holds. That's what the error is telling you.
If you don't know the size, you can use List<T> which will grow as needed. Internally List<T> is implemented using T[] which means that from a look-up stand point it acts like an array. However, the work of reallocating larger arrays as needed is handled by List<T>.

Looking for a lock-free RT-safe single-reader single-writer structure

I'm looking for a lock-free design conforming to these requisites:
a single writer writes into a structure and a single reader reads from this structure (this structure exists already and is safe for simultaneous read/write)
but at some time, the structure needs to be changed by the writer, which then initialises, switches and writes into a new structure (of the same type but with new content)
and at the next time the reader reads, it switches to this new structure (if the writer multiply switches to a new lock-free structure, the reader discards these structures, ignoring their data).
The structures must be reused, i.e. no heap memory allocation/free is allowed during write/read/switch operation, for RT purposes.
I have currently implemented a ringbuffer containing multiple instances of these structures; but this implementation suffers from the fact that when the writer has used all the structures present in the ringbuffer, there is no more place to change from structure... But the rest of the ringbuffer contains some data which don't have to be read by the reader but can't be re-used by the writer. As a consequence, the ringbuffer does not fit this purpose.
Any idea (name or pseudo-implementation) of a lock-free design? Thanks for having considered this problem.
Here's one. The keys are that there are three buffers and the reader reserves the buffer it is reading from. The writer writes to one of the other two buffers. The risk of collision is minimal. Plus, this expands. Just make your member arrays one element longer than the number of readers plus the number of writers.
class RingBuffer
{
RingBuffer():lastFullWrite(0)
{
//Initialize the elements of dataBeingRead to false
for(unsigned int i=0; i<DATA_COUNT; i++)
{
dataBeingRead[i] = false;
}
}
Data read()
{
// You may want to check to make sure write has been called once here
// to prevent read from grabbing junk data. Else, initialize the elements
// of dataArray to something valid.
unsigned int indexToRead = lastFullWriteIndex;
Data dataCopy;
dataBeingRead[indexToRead] = true;
dataCopy = dataArray[indexToRead];
dataBeingRead[indexToRead] = false;
return dataCopy;
}
void write( const Data& dataArg )
{
unsigned int writeIndex(0);
//Search for an unused piece of data.
// It's O(n), but plenty fast enough for small arrays.
while( true == dataBeingRead[writeIndex] && writeIndex < DATA_COUNT )
{
writeIndex++;
}
dataArray[writeIndex] = dataArg;
lastFullWrite = &dataArray[writeIndex];
}
private:
static const unsigned int DATA_COUNT;
unsigned int lastFullWrite;
Data dataArray[DATA_COUNT];
bool dataBeingRead[DATA_COUNT];
};
Note: The way it's written here, there are two copies to read your data. If you pass your data out of the read function through a reference argument, you can cut that down to one copy.
You're on the right track.
Lock free communication of fixed messages between threads/processes/processors
fixed size ring buffers can be used in lock free communications between threads, processes or processors if there is one producer and one consumer. Some checks to perform:
head variable is written only by producer (as an atomic action after writing)
tail variable is written only by consumer (as an atomic action after reading)
Pitfall: introduction of a size variable or buffer full/empty flag; these are typically written by both producer and consumer and hence will give you an issue.
I generally use ring buffers for this purpoee. Most important lesson I've learned is that a ring buffer of can never contain more than elements. This way a head and tail variable are written by producer respectively consumer.
Extension for large/variable size blocks
To use buffers in a real time environment, you can either use memory pools (often available in optimized form in real time operating systems) or decouple allocation from usage. The latter fits to the question, I believe.
If you need to exchange large blocks, I suggest to use a pool with buffer blocks and communicate pointers to buffers using a queue. So use a 3rd queue with buffer pointers. This way the allocates can be done in application (background) and you real time portion has access to a variable amount of memory.
Application
while (blockQueue.full != true)
{
buf = allocate block of memory from heap or buffer pool
msg = { .... , buf };
blockQueue.Put(msg)
}
Producer:
pBuf = blockQueue.Get()
pQueue.Put()
Consumer
if (pQueue.Empty == false)
{
msg=pQueue.Get()
// use info in msg, with buf pointer
// optionally indicate that buf is no longer used
}

MFC multithreading with delete[] , dbgheap.c

I've got a strange problem and really don't understand what's going on.
I made my application multi-threaded using the MFC multithreadclasses.
Everything works well so far, but now:
Somewhere in the beginning of the code I create the threads:
m_bucketCreator = new BucketCreator(128,128,32);
CEvent* updateEvent = new CEvent(FALSE, FALSE);
CWinThread** threads = new CWinThread*[numThreads];
for(int i=0; i<8; i++){
threads[i]=AfxBeginThread(&MyClass::threadfunction, updateEvent);
m_activeRenderThreads++;
}
this creates 8 threads working on this function:
UINT MyClass::threadfunction( LPVOID params ) //executed in new Thread
{
Bucket* bucket=m_bucketCreator.getNextBucket();
...do something with bucket...
delete bucket;
}
m_bucketCreator is a static member. Now I get some thread error in the deconstructor of Bucket on the attempt to delete a buffer (however, the way I understand it this buffer should be in the memory of this thread, so I don't get why there is an error). On the attempt of delete[] buffer, the error happens in _CrtIsValidHeapPointer() in dbgheap.c.
Visual studio outputs the message that it trapped a halting point and this can be either due to heap corruption or because the user pressed f12 (I didn't ;) )
class BucketCreator {
public:
BucketCreator();
~BucketCreator(void);
void init(int resX, int resY, int bucketSize);
Bucket* getNextBucket(){
Bucket* bucket=NULL;
//enter critical section
CSingleLock singleLock(&m_criticalSection);
singleLock.Lock();
int height = min(m_resolutionY-m_nextY,m_bucketSize);
int width = min(m_resolutionX-m_nextX,m_bucketSize);
bucket = new Bucket(width, height);
//leave critical section
singleLock.Unlock();
return bucket;
}
private:
int m_resolutionX;
int m_resolutionY;
int m_bucketSize;
int m_nextX;
int m_nextY;
//multithreading:
CCriticalSection m_criticalSection;
};
and class Bucket:
class Bucket : public CObject{
DECLARE_DYNAMIC(RenderBucket)
public:
Bucket(int a_resX, int a_resY){
resX = a_resX;
resY = a_resY;
buffer = new float[3 * resX * resY];
int buffersize = 3*resX * resY;
for (int i=0; i<buffersize; i++){
buffer[i] = 0;
}
}
~Bucket(void){
delete[] buffer;
buffer=NULL;
}
int getResX(){return resX;}
int getResY(){return resY;}
float* getBuffer(){return buffer;}
private:
int resX;
int resY;
float* buffer;
Bucket& operator = (const Bucket& other) { /*..*/}
Bucket(const Bucket& other) {/*..*/}
};
Can anyone tell me what could be the problem here?
edit: this is the other static function I'm calling from the threads. Is this safe to do?
static std::vector<Vector3> generate_poisson(double width, double height, double min_dist, int k, std::vector<std::vector<Vector3> > existingPoints)
{
CSingleLock singleLock(&m_criticalSection);
singleLock.Lock();
std::vector<Vector3> samplePoints = std::vector<Vector3>();
...fill the vector...
singleLock.Unlock();
return samplePoints;
}
All the previous replies are sound. For the copy constructor, make sure that it doesn't just copy the buffer pointer, otherwise that will cause the problem. It needs to allocate a new buffer, not the pointer value, which would cause an error in 'delete'. But I don't get the impression that the copy contructor will get called in your code.
I've looked at the code and I am not seeing any error in it as is. Note that the thread synchronization isn't even necessary in this GetNextBucket code, since it's returning a local variable and those are pre-thread.
Errors in ValidateHeapPointer occur because something has corrupted the heap, which happens when a pointer writes past a block of memory. Often it's a for() loop that goes too far, a buffer that wasn't allocated large enough, etc.
The error is reported during a call to 'delete' because that's when the heap is validated for bugs in debug mode. However, the error has occurred before that time, it just happens that the heap is checked only in 'new' and 'delete'. Also, it isn't necessarily related to the 'Bucket' class.
What you need to need to find this bug, short of using tools like BoundsChecker or HeapValidator, is comment out sections of your code until it goes away, and then you'll find the offending code.
There is another method to narrow down the problem. In debug mode, include in your code, and sprinkle calls to _CrtCheckMemory() at various points of interest. That will generate the error when the heap is corrupted. Simply move the calls in your code to narrow down at what point the corruption begins to occur.
I don't know which version of Visual C++ you are using. If you're using a earlier one like VC++ 6.0, make sure that you are using the Multitreaded DLL version of the C Run Time Library in the compiler option.
You're constructing a RenderBucket. Are you sure you're calling the 'Bucket' class's constructor from there? It should look like this:
class RenderBucket : public Bucket {
RenderBucket( int a_resX, int a_resY )
: Bucket( a_resX, a_resY )
{
}
}
Initializers in the Bucket class to set the buffer to NULL is a good idea... Also making the Default constructor and copy constructor private will help to make double sure those aren't being used. Remember.. the compiler will create these automatically if you don't:
Bucket(); <-- default constructor
Bucket( int a_resx = 0, int a_resy = 0 ) <-- Another way to make your default constructor
Bucket(const class Bucket &B) <-- copy constructor
You haven't made a private copy constructor, or any default constructor. If class Bucket is constructed via one of these implicitly-defined methods, buffer will either be uninitialized, or it will be a copied pointer made by a copy constructor.
The copy constructor for class Bucket is Bucket(const Bucket &B) -- if you do not explicitly declare a copy constructor, the compiler will generate a "naive" copy constructor for you.
In particular, if this object is assigned, returned, or otherwise copied, the copy constructor will copy the pointer to a new object. Eventually, both objects' destructors will attempt to delete[] the same pointer and the second attempt will be a double deletion, a type of heap corruption.
I recommend you make class Bucket's copy constructor private, which will cause attempted copy construction to generate a compile error. As an alternative, you could implement a copy constructor which allocates new space for the copied buffer.
Exactly the same applies to the assignment operator, operator=.
The need for a copy constructor is one of the 55 tips in Scott Meyer's excellent book, Effective C++: 55 Specific Ways to Improve Your Programs and Designs:
This book should be required reading for all C++ programmers.
If you add:
class Bucket {
/* Existing code as-is ... */
private:
Bucket() { buffer = NULL; } // No default construction
Bucket(const Bucket &B) { ; } // No copy construction
Bucket& operator= (const Bucket &B) {;} // No assignment
}
and re-compile, you are likely to find your problem.
There is also another possibility: If your code contains other uses of new and delete, then it is possible these other uses of allocated memory are corrupting the linked-list structure which defines the heap memory. It is common to detect this corruption during a call to delete, because delete must utilize these data structures.

Resources