How to share data between nodes - contiki-process

I have created a network with one sink & five sender nodes. I wanted to share one single data (an integer value) with all the nodes. I wanted the nodes to be able to update as well as read the shared integer value. I used a pointer and tried to share the memory between the nodes. However, since contiki uses protothreads, the data contained in the referenced memory is not saved for the other nodes rather it is only update in a single node.
Here is what I tried.
code for the sender nodes
int *ptr;
void share(int *val){
ptr=val;
printf(" the number after call is %d ",*ptr);
}
void display(){
printf(" the number is %d ",*ptr);
}
code for the sink node
void sendNumber(){
int *val;
*val = 12;
share(val);
}
when share(val) is called by the sink, the print result is
the number after call is 12
if display() is called by another node, the print result is
the number is 1
both the codes for the sender & sink are in different files.
I want the number assigned in *val to be displayed in both the prints. That is once sink node calls node A's share(val), the received value should be saved in the memory to be accessed by all the remaining 4 nodes. However, my code updates the memory for all the 5 nodes separately.
I would really & truly appreciate your help.
Thank you

Related

Local memory for each CUDA thread

I have a simple program below. My question is that where is "temp" actually stored? is it in global or local memory? I need array temp for each idx so that every thread has individual array temp. In this case, it is working properly. But in my actual program, when I tried to fill temp[0] from test2 it made the program stopped. Suppose we have 1024 threads then it only run the kernel around 200 threads. So, I am wondering whether temp is shared or not. If yes, maybe there is a collision there. I also did not get any error messsage. Please someone explain about this.
__device__ void test2(int temp[], int idx) {
temp[0] = idx;
printf("%d ", temp[0]);
}
__global__ void test() {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int *temp = (int *) malloc(100 * sizeof (int));
test2(temp, idx);
}
int main() {
test << <1, 1024 >> >();
return 0;
}
My question is that where is "temp" actually stored?
The allocation for temp is stored in a place called the device heap. It is a form of global memory. However the temp variable itself (i.e. the pointer value) is in local memory - not shared or visible to other threads.
I need array temp for each idx so that every thread has individual array temp.
You will get that, subject to caveats below. Each thread will have its own individual array, referenced by its local variable temp. Each thread will have a separate allocation for storage on the device heap.
People commonly have problems with in-kernel new or malloc. One of the main reasons is that the device heap is initially limited to 8MB, across all of your device heap allocations. So if enough threads do a new or malloc of enough allocation requests, you will run out of space.
When you run out of space, the API way to signal that is to return a zero pointer value for the allocation (a NULL pointer). If you then attempt to use this NULL pointer, you will have trouble.
For debugging purposes (i.e. to prove this is happening), test the pointer for NULL (i.e. == 0) before using it. If it is NULL, don't use it (perhaps print an error message instead).
You can read more about this in the documentation or in many questions here on the SO cuda tag. If you read any of these sources, you will discover that you can increase the size of the device heap.

Was: How does BPF calculate number of CPU for PERCPU_ARRAY?

I have encountered an interesting issue where a PERCPU_ARRAY created on one system with 2 processors creates an array with 2 per-CPU elements and on another system with 2 processors, an array with 128 per-CPU elements. The latter was rather unexpected to me!
The way I discovered this behavior is that a program that allocated an array for the number of CPUs (using get_nprocs_conf(3)) and then read in the PERCPU_ARRAY into it (using bpf_map_lookup_elem()) ended up writing past the end of the array and crashing.
I would like to find out what is the proper way to determine in a program that reads BPF maps the number of elements in a PERCPU_ARRAY used on a system.
Failing that, I think the second best approach is to pick a buffer for reading in that is "large enough." Here, the problem is similar: what is that number and is there way to learn it at runtime?
The question comes from reading the source of bpftool, which figures this out:
unsigned int get_possible_cpus(void)
{
int cpus = libbpf_num_possible_cpus();
if (cpus < 0) {
p_err("Can't get # of possible cpus: %s", strerror(-cpus));
exit(-1);
}
return cpus;
}
int libbpf_num_possible_cpus(void)
{
static const char *fcpu = "/sys/devices/system/cpu/possible";
static int cpus;
int err, n, i, tmp_cpus;
bool *mask;
/* ---8<--- snip */
}
So that's how they do it!

DPDK rte_hash multithreading

everyone! I'm writing DPDK based application. I'm reading packets from NIC, enqueing them to the ring. Then I have multiple worker threads which dequeues packets from rx ring, parse their headers to get destination ip address and level 4 protocol destination port. This data is packed into structure:
struct session_key {
rte_be32_t ip_dst;
rte_be16_t port_dst;
};
This structure is used as a key in rte_hash table. As data in this hash table I use uint32_t counter which increments when packet matches key. I create rte_hash with RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY flag to make it thread safe for multithreaded read and write.
Each worker thread gets dst_ip and dst_port from packet and lookups hash table for such key. If key exists, its value incremented, if key does not exists it is added to the table with data = 1.
uint32_t *found;
int ret = rte_hash_lookup_data(sessions_hash_table, (void *)&key, (void **)&found);
if (ret < 0) {
uint32_t *data = rte_zmalloc("session_key", sizeof(uint32_t), 0);
*data = 1;
rte_hash_add_key_data(sessions_hash_table, &key, data);
} else {
(*found)++;
}
So I have multiple readers, writers to the hash table. After all workers are finished common statistics is calculated in main thread. How much packets match each pair of ip and port is printed on screen.
The problem is that when I use only one worker there is no problem, amount of received packets is equal to packets saved in hash table.
But when I use several worker threads I get unequal numbers. I understand that there is probability of reading table while another thread is writing to it. But I thought that config flags like RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY will help me to deal with multithreading.
So I need some advice on how to make rte_hash work in multithreaded application where several threads write and read the same hash table.

Serial data acquisition program reading from buffer

I have developed an application in Visual C++ 2008 to read data periodically (50ms) from a COM Port. In order to periodically read the data, I placed the read function in an OnTimer function, and because I didn't want the rest of the GUI to hang, I called this timer function from within a thread. I have placed the code below.
The application runs fine, but it is showing the following unexpected behaviour: after the data source (a hardware device or even a data emulator) stop sending data, my application continues to receive data for a period of time that is proportional to how long the read function has been running for (EDIT: This excess period is in the same ballpark as the period of time the data is sent for). So if I start and stop the data flow immediately, this would be reflected on my GUI, but if I start data flow and stop it ten seconds later, my GUI continues to show data for 10 seconds more (EDITED).
I have made the following observations after exhausting all my attempts at debugging:
As mentioned above, this excess period of operation is proportional to how long the hardware has been sending data.
The frequency of incoming data is 50ms, so to receive 10 seconds worth of data, my GUI must be receiving around 200 more data packets.
The only buffer I have declared is abBuffer which is just a byte array of fixed size. I don't think this can increase in size, so this data is being stored somewhere.
If I change something in the data packet, this change, understandably, is shown on the GUI after a delay (because of the above points). But this would imply that the data received at the COM port is stored in some variable sized buffer from which my read function is reading data.
I have timed the read and processing periods. The latter is instantaneous while the former very rarely (3 times in 1000 reads (following no discernible pattern)) takes 16ms. This is well within the 50ms window the GUI has for each read.
The following is my thread and timer code:
UINT CMyCOMDlg::StartThread(LPVOID param)
{
THREADSTRUCT *ts = (THREADSTRUCT*)param;
ts->_this->SetTimer(1,50,0);
return 0;
}
//Timer function that is called at regular intervals
void CMyCOMDlg::OnTimer(UINT_PTR nIDEvent)
{
if(m_bCount==true)
{
DWORD NoBytesRead;
BYTE abBuffer[45];
if(ReadFile((m_hComm),&abBuffer,45,&NoBytesRead,0))
{
if(NoBytesRead==45)
{
if(abBuffer[0]==0x10&&abBuffer[1]==0x10||abBuffer[0]==0x80&&abBuffer[1]==0x80)
{
fnSetData(abBuffer);
}
else
{
CString value;
value.Append("Header match failed");
SetDlgItemText(IDC_RXRAW,value);
}
}
else
{
CString value;
value.Append(LPCTSTR(abBuffer),NoBytesRead);
value.Append("\r\nInvalid Packet Size");
SetDlgItemText(IDC_RXRAW,value);
}
}
else
{
DWORD dwError2 = GetLastError();
CString error2;
error2.Format(_T("%d"),dwError2);
SetDlgItemText(IDC_RXRAW,error2);
}
fnClear();
}
else
{
KillTimer(1);
}
CDialog::OnTimer(nIDEvent);
}
m_bCount is just a flag I use to kill the timer and the ReadFile function is a standard Windows API call. ts is a structure that contains a pointer to the main dialog class, i.e., this.
Can anyone think of a reason this could be happening? I have tried a lot of things, and also my code does so little I cannot figure out where this unexpected behaviour is happening.
EDIT:
I am adding the COM port settings and timeouts used below :
dcb.BaudRate = CBR_115200;
dcb.ByteSize = 8;
dcb.StopBits = ONESTOPBIT;
dcb.Parity = NOPARITY;
SetCommState(m_hComm, &dcb);
_param->_this=this;
COMMTIMEOUTS timeouts;
timeouts.ReadIntervalTimeout=1;
timeouts.ReadTotalTimeoutMultiplier = 0;
timeouts.ReadTotalTimeoutConstant = 10;
timeouts.WriteTotalTimeoutMultiplier = 1;
timeouts.WriteTotalTimeoutConstant = 1;
SetCommTimeouts(m_hComm, &timeouts);
You are processing one message at a time in the OnTimer() function. Since the timer interval is 1 second but the data source keeps sending message every 50 milliseconds, your application cannot process all messages in the timely manner.
You can add while loop as follow:
while(true)
{
if(::ReadFile(m_hComm, &abBuffer, sizeof(abBuffer), &NoBytesRead, 0))
{
if(NoBytesRead == sizeof(abBuffer))
{
...
}
else
{
...
break;
}
}
else
{
...
break;
}
}
But there is another problem in your code. If your software checks the message while the data source is still sending the message, NoBytesRead could be less than 45. You may want to store the data into the message buffer like CString or std::queue<unsigned char>.
If the message doesn't contain a NULL at the end of the message, passing the message to the CString object is not safe.
Also if the first byte starts at 0x80, CString will treat it as a multi-byte string. It may cause the error. If the message is not a literal text string, consider using other data format like std::vector<unsigned char>.
By the way, you don't need to call SetTimer() in the separate thread. It doesn't take time to kick a timer. Also I recommend you to call KillTimer() somewhere outside of the OnTimer() function so that the code will be more intuitive.
If the data source continuously keeps sending data, you may need to use PurgeComm() when you open/close the COMM port.

why it's slowly when I parse a message of Google protocol buffer in multi-thread?

I try to parse many Google protocol buffer messages from a binary file generated by calling SerializeToString. I first load all Bytes into a heap memory by calling new function. I also have two arrays to store the Bytes begin address of a message in the heap memory and the Bytes count of the message.
Then I begin to parse message by calling ParseFromString.I want to quicken the procedure by using multi-thread.
In each thread, I pass the start index and end index of address array and Byte count array.
In parent process. the main code is:
struct ParsePara
{
char* str_buffer;
size_t* buffer_offset;
size_t* binary_string_length_array;
size_t start_idx;
size_t end_idx;
Flight_Ticket_Info* ticket_info_buffer_array;
};
//Flight_Ticket_Info is class of message
//offset_size is the count of message
ticket_array = new Flight_Ticket_Info[offset_size];
const int max_thread_count = 6;
pthread_t pthread_id_vec[max_thread_count];
CTimer thread_cost;
thread_cost.start();
vector<ParsePara*> para_vec;
const size_t each_count = ceil(float(offset_size) / max_thread_count);
for (size_t k = 0;k < max_thread_count;k++)
{
size_t start_idx = each_count * k;
size_t end_idx = each_count * (k+1);
if (start_idx >= offset_size)
break;
if (end_idx >= offset_size)
end_idx = offset_size;
ParsePara* cand_para_ptr = new ParsePara();
if (!cand_para_ptr)
{
_ERROR_EXIT(0,"[Malloc memory fail.]");
}
cand_para_ptr->str_buffer = m_valdata;//heap memory for storing Bytes of message
cand_para_ptr->buffer_offset = offset_array;//begin address of each message
cand_para_ptr->start_idx = start_idx;
cand_para_ptr->end_idx = end_idx;
cand_para_ptr->ticket_info_buffer_array = ticket_array;//array to store message
cand_para_ptr->binary_string_length_array = binary_length_array;//Bytes count of each message
para_vec.push_back(cand_para_ptr);
}
for(size_t k = 0 ;k < para_vec.size();k++)
{
int ret = pthread_create(&pthread_id_vec[k],NULL,parserFlightTicketForMultiThread,para_vec[k]);
if (0 != ret)
{
_ERROR_EXIT(0,"[Error] [create thread fail]");
}
}
for (size_t k = 0;k < para_vec.size();k++)
{
pthread_join(pthread_id_vec[k],NULL);
}
In each thread the thread function is:
void* parserFlightTicketForMultiThread(void* void_para_ptr)
{
ParsePara* para_ptr = (ParsePara*) void_para_ptr;
parserFlightTicketForMany(para_ptr->str_buffer,para_ptr->ticket_info_buffer_array,para_ptr->buffer_offset,
para_ptr->start_idx,para_ptr->end_idx,para_ptr->binary_string_length_array);
}
void parserFlightTicketForMany(const char* str_buffer,Flight_Ticket_Info* ticket_info_buffer_array,
size_t* buffer_offset,const size_t start_idx,const size_t end_idx,size_t* binary_string_length_array)
{
printf("start_idx:%d,end_idx:%d\n",start_idx,end_idx);
for (size_t k = start_idx;k < end_idx;k++)
{
if (k % 100000 == 0)
cout << k << endl;
size_t cand_offset = buffer_offset[k];
size_t binary_length = binary_string_length_array[k];
ticket_info_buffer_array[k].ParseFromString(string(&str_buffer[cand_offset],binary_length-1));
}
printf("done %ld %ld\n",start_idx,end_idx);
}
But multi-thread cost is more than one thread.
one thread cost is:40455623ms
My computer is 8 core and six thread cost is:131586865ms
Anyone can help me? thank you!
Some possible problems -- you'll have to experiment to determine which:
Protobuf parsing speed is often limited by memory bandwidth rather than CPU time, especially with a large input data set. In that case, more threads won't help, since all the cores are sharing bandwidth to main memory. Indeed, having multiple cores fighting over memory bandwidth could make the overall operation slower. Note that the biggest consumer of memory is not the input bytes but rather the parsed data objects -- that is, the output of parsing -- which are many times larger than the encoded data. To improve this problem, consider writing the parsing loop so that it fully-processes each message immediately after parsing, before moving on to the text message. That way, instead of allocating k protobuf objects, you only need to allocate one protobuf object per thread, and repeatedly reuse the same object for parsing. This way the object will (probably) stay in the core's private L1 cache and avoid consuming memory bandwidth; only the input bytes will be read over the main bus.
How are you loading data into RAM? Did you read() into a large array or did you mmap()? In the latter case the data is read from disk lazily -- it won't happen until you actually attempt to parse it. Even in the read() case, it could be that the data has been swapped out, creating similar effects. Either way, your threads are now not just fighting for memory bandwidth, but disk bandwidth, which is of course much slower. Having six threads reading separate parts of a big file will definitely be slower overall than having one thread read the whole file, because the operating system optimizes for sequential access.
Protobuf allocates memory during parsing. Many memory allocators take a lock while allocating new memory. Since all your threads are allocating tons and tons of objects in a tight loop, they will contend for this lock. Make sure you are using a thread-friendly memory allocator, such as Google's tcmalloc. Note that repeatedly reusing the same protobuf object in a parse-consume loop rather than allocating lots of different objects will also help immensely here, because the protobuf object will automatically reuse memory for sub-objects.
There may be a bug in your code and it might not be doing what you expect at all when multithreaded. For example, a bug might be causing all the threads to process the same data, rather than different data, and it could be that the data they're choosing happens to be bigger. Make sure you are testing that the results of your code are exactly the same when you run single-threaded vs. multi-threaded.
In short, if you want multiple cores to make your code faster, you have to think about not just what each core is doing, but what data is going in and out of each core, and how much the cores have to talk to each other. Ideally you want each core to operate all on its own without talking to anyone or anything; then you get maximum parallelism. That's not usually possible, of course, but the closer you can get to that, the better.
BTW, a random optimization for you:
ParseFromString(string(&str_buffer[cand_offset],binary_length-1))
Replace that with:
ParseFromArray(&str_buffer[cand_offset],binary_length-1)
Creating at std::string makes a copy of the data, which wastes time (and memory bandwidth). (This doesn't explain why threading is slow, though.)

Resources