I'm trying to parallelize an algorithm using PVM for a University assignment. I've got the algorithm sorted, but parallelization only almost works - the process intermittently gets stuck for no apparent reason. I can see no pattern, a run with the same parameters might work 10 times and then just gets stuck on the next effort...
None of the pvm functions (in the master or any child process) are returning any error codes, the children seem to complete successfully, no errors are reaching the console. It really does just look like the master isn't receiving every communication from the children - but only on occasional runs.
Oddly, though, I don't think it's just skipping a message - I've yet to have a result missing from a child that then successfully sent over a complete signal (that is to say I've not had a run reach completion and return an unexpected result) - it's as though the child just becomes disconnected, and all messages from a certain point cease arriving.
Batching the results up and sending less, but larger, messages seems to improve reliability, at least it feels like it's sticking less often - I don't have hard numbers to back this up...
Is it normal, common or expected that PVM will lose messages sent via pvm_send and it's friends? Please note the error occurs if all processes run on a single host or multiple hosts.
Am I doing something wrong? Is there something I can do to help prevent this?
Update
I've reproduced the error on a very simple test case, code below, which just spawns four children sends a single number to each, each child multiplies the number it receives by five and sends it back. It works almost all the time, but occasionally we freeze with only three numbers printed out - with one child's result missing (and said child will have completed).
Master:
int main()
{
pvm_start_pvmd( 0 , NULL , 0 );
int taskIDs[global::taskCount];
pvm_spawn( "/path/to/pvmtest/child" , NULL , 0 , NULL , global::taskCount , taskIDs );
int numbers[constant::taskCount] = { 5 , 10 , 15 , 20 };
for( int i=0 ; i<constant::taskCount ; ++i )
{
pvm_initsend( 0 );
pvm_pkint( &numbers[i] , 1 , 1 );
pvm_send( taskIDs[i] , 0 );
}
int received;
for( int i=0 ; i<global::taskCount ; ++i )
{
pvm_recv( -1 , -1 );
pvm_upkint( &received , 1 , 1 );
std::cout << recieved << std::endl;
}
pvm_halt();
}
Child:
int main()
{
int number;
pvm_recv( -1 , -1 );
pvm_upkint( &number , 1 , 1 );
number *= 10;
pvm_initsend( 0 );
pvm_pkint( &number , 1 , 1 );
pvm_send( pvm_parent() , 0 );
}
Not really an answer, but two things have changed together and the problem seems to have subsided:
I added pvm_exit() a call to the end of the slave binary, which apparently is best to do.
The configuration of PVM over the cluster changed ... somehow ... I don't have any specifics, but a few nodes were previously unable to take part in PVM operations and can now can. Other things may have changed as well.
I suspect something within the second changed also happened to fix my problem.
Related
I am trying to list the state of all the tasks that are currently running using vTaskList().Whenever I call the function I get a HardFault and I have no idea where it faults. I tried increasing the Heap size and stack size. This causes the vTaskList() to work once but for the second time it throws a hard fault again.
Following is how I am using vTaskList() in osThreadList()
osStatus osThreadList (uint8_t *buffer)
{
#if ( ( configUSE_TRACE_FACILITY == 1 ) && ( configUSE_STATS_FORMATTING_FUNCTIONS == 1 ) )
vTaskList((char *)buffer);
#endif
return osOK;
}
Following is how i use osThreadList() to print all the tasks on my serial terminal.
uint8_t TskBuf[1024];
bool IOParser::TSK(bool print_help)
{
if(print_help)
{
uart_printf("\nTSK: Display list of tasks.\r\n");
}
else
{
uart_printf("\r\nName State Priority Stack Num\r\n" );
uart_printf("---------------------------------------------\r\n");
/* The list of tasks and their status */
osThreadList(TskBuf);
uart_printf( (char *)TskBuf);
uart_printf("---------------------------------------------\r\n");
uart_printf("B : Blocked, R : Ready, D : Deleted, S : Suspended");
}
return true;
}
When I comment out any one of the tasks I am able to get it working. I am guessing it is something related to memory but I havent been able to find a solution.
vTaskList is dependent on sprintf. So, your guess about memory and heap is right. But you have to use malloc and pass that block instead of what you do. Use pvPortmalloc and after you finish, free it up using vportfree.
Also, it is worthwhile noting that vTaskList is a blocking function.
I do not have a working code example to show this as at now, but this should work.
Hard Faults are often (almost all the time) happens due to uninitialised pointer. Above approach will eliminate that.
Currently I'm working with Metal compute shaders and trying to understand how GPU threads synchronization works there.
I wrote a simple code but it doesn't work the way I expect it:
Consider I have threadgroup variable, which is array where all threads can produce an output simultaneously.
kernel void compute_features(device float output [[ buffer(0) ]],
ushort2 group_pos [[ threadgroup_position_in_grid ]],
ushort2 thread_pos [[ thread_position_in_threadgroup]],
ushort tid [[ thread_index_in_threadgroup ]])
{
threadgroup short blockIndices[288];
float someValue = 0.0
// doing some work here which fills someValue...
blockIndices[thread_pos.y * THREAD_COUNT_X + thread_pos.x] = someValue;
//wait when all threads are done with calculations
threadgroup_barrier(mem_flags::mem_none);
output += blockIndices[thread_pos.y * THREAD_COUNT_X + thread_pos.x]; // filling out output variable with threads calculations
}
The code above doesn't work. Output variable doesn't contain all threads calculations, it contains only the value from the thread which was presumable the last at adding up a value to output. To me it seems like threadgroup_barrier does absolutely nothing.
Now, the interesting part. The code below works:
blockIndices[thread_pos.y * THREAD_COUNT_X + thread_pos.x] = someValue;
threadgroup_barrier(mem_flags::mem_none); //wait when all threads are done with calculations
if (tid == 0) {
for (int i = 0; i < 288; i ++) {
output += blockIndices[i]; // filling out output variable with threads calculations
}
}
And this code also works as good as the previous one:
blockIndices[thread_pos.y * THREAD_COUNT_X + thread_pos.x] = someValue;
if (tid == 0) {
for (int i = 0; i < 288; i ++) {
output += blockIndices[i]; // filling out output variable with threads calculations
}
}
To summarize: My code works as expected only when I'm handling threadgroup memory in one GPU thread, no matter what's the id of it, it can be the last thread in the threadgroup as well as the first one. And presense of threadgroup_barrier makes absolutely no difference. I also used threadgroup_barrier with mem_threadgroup flag, code still doesn't work.
I understand that I might be missing some very important detail and I would be happy if someone can point me out to my errors. Thanks in advance!
When you write output += blockIndices[...], all threads will try to perform this operation at the same time. But since output is not an atomic variable, this results in race conditions. It's not a threadsafe operation.
Your second solution is the correct one. You need to have just a single thread to collect the results (although you could split this up across multiple threads too). That it still works OK if you remove the barrier may just be due to luck.
I'm facing weird behavior with FreeRTOS code.
Especially when using vTaskDelayUntil() and vTaskDelay()
I'm trying to read an input pin from my PIR sensor.
On the scope I see that the PIR is holding 3.3v high for at least 1 second.
The code below only reads my PIR input when I comment out the ' vTaskDelayUntil' line. As soon as I activate that line, PINC register is always 0.
Also when I'm sure there is 3.3v on my input pin.
static void TaskStatemachine(void *pvParameters)
{
(void) pvParameters;
TickType_t xLastWakeTime;
const TickType_t xFrequency = 100;
xLastWakeTime = xTaskGetTickCount();
for(;;)
{
printf("PINC.1 = %d\n", (PINC & (1<<1) ));
vTaskDelayUntil( &xLastWakeTime, ( xFrequency / portTICK_PERIOD_MS ) );
}
}
What is happening here?
I changed xFrequency to different values, but without any luck.
As an experiment, simplify the output thus:
putchar( (PINC & (1<<1)) == 0 ? '0' : '1' ) ;
You will then get a continuous stream of 1 or 0.
If that works with or without the delay, then it seems likely that that the task has too small a stack to support printf(). Try increasing the stack and putting the printf() back in.
My team is trying to control the frequency of an Texas Instruments OMAP l138. The default frequency is 300 MHz and we want to put it to 372 MHz in a "complete" form: we would like not only to change the default value to the desired one (or at least configure it at startup), but also be capable of changing the value at run time.
Searching on the web about how to do this, we found an article which tells that one of the ways to do this is by an "echo" command:
echo 372000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
We did some tests with this command and it runs fine with one problem: sometimes the first call to this echo command leads to a error message of "Division by zero in kernel":
In my personal tests, this error appeared always in the first call to the echo command. All the later calls worked without error. If, then, I reset my processor and calls the command again, the same problem occurs: the first call leads to this error and later calls work without problem.
So my questions are: what is causing this problem? And how could I solve it? (Obviously the answer "always type it twice" doesn't count!)
(Feel free to mention other ways of controlling the OMAP l138's frequency at real time as well!)
Looks to me like you have division by zero in davinci_spi_cpufreq_transition() function. Somewhere in this function (or in some function that's being called in davinci_spi_cpufreq_transition) there is a buggy division operation which tries to divide by some variable which is (in your case) has value of 0. And this is obviously error case which should be handled properly in code, but in fact it isn't.
It's hard to tell which code exactly leads to this, because I don't know which kernel you are using. It would be much more easier if you can provide link to your kernel repository. Although I couldn't find davinci_spi_cpufreq_transition in upstream kernel, I found it here.
davinci_spi_cpufreq_transition() function appears to be in drivers/spi/davinci_spi.c. It calls davinci_spi_calc_clk_div() function. There are 2 division operations there. First is:
prescale = ((clk_rate / hz) - 1);
And second is:
if (hz < (clk_rate / (prescale + 1)))
One of them is probably causing "division by zero" error. I propose you to trace which one is that by modifying davinci_spi_calc_clk_div() function in next way (just add lines marked as "+"):
static void davinci_spi_calc_clk_div(struct davinci_spi *davinci_spi)
{
struct davinci_spi_platform_data *pdata;
unsigned long clk_rate;
u32 hz, cs_num, prescale;
pdata = davinci_spi->pdata;
cs_num = davinci_spi->cs_num;
hz = davinci_spi->speed;
clk_rate = clk_get_rate(davinci_spi->clk);
+ printk(KERN_ERR "### hz = %u\n", hz);
prescale = ((clk_rate / hz) - 1);
if (prescale > 0xff)
prescale = 0xff;
+ printk("### prescale + 1 = %u\n", prescale + 1UL);
if (hz < (clk_rate / (prescale + 1)))
prescale++;
if (prescale < 2) {
pr_info("davinci SPI controller min. prescale value is 2\n");
prescale = 2;
}
clear_fmt_bits(davinci_spi->base, 0x0000ff00, cs_num);
set_fmt_bits(davinci_spi->base, prescale << 8, cs_num);
}
My guess -- it's "hz" variable which is 0 in your case. If it's so, you also may want to add next debug line to davinci_spi_setup_transfer() function:
if (!hz)
hz = spi->max_speed_hz;
+ printk(KERN_ERR "### setup_transfer: setting speed to %u\n", hz);
davinci_spi->speed = hz;
davinci_spi->cs_num = spi->chip_select;
With all those modifications made, rebuild your kernel and you will probably get the clue why you have that "div by zero" error. Just look for lines started with "###" in your kernel boot log. In case you don't know what to do next -- attach those debug lines and I will try to help you.
Uint32 prev = SDL_GetTicks();
while ( true )
{
Draw();
Uint32 now = SDL_GetTicks();
Uint32 delta = now - prev;
printf( "%u\n" , delta );
Update( delta / 1000.0f );
prev = now;
ProcessEvents();
}
The application is a simple moving square. My loop looks like that and when vsync is on the whole thing just runs quite smoothly; turning it off instead causes some kind of jumps of the animation. I've inserted some prints and here's what I've found:
[...]
16
15
16
66 #
2 #
0 #
0 #
16
16
21
[...]
I know there are several issues with this kind of loop but none of them seem to apply to this simple example (am I wrong?). What causes this behavior and how can I overcome it?
I'm using an ATI card on a Linux system, but I'm expecting a portable explanation/solution.
It seems that it was a lack of glFinish(), I've read somewhere that calls to that function are in most cases useless (here or here for example). Well, I'm maybe misunderstanding some fundamental concepts but that worked for me and now the Draw() function ends with:
[...]
glFinish();
SDL_GL_SwapBuffers();
}