Simple Mac ProgressIndicator causes crash: "caught causing excessive wakeups" - multithreading

I have this Button click handler (MonoMac on OS X 10.9.3):
partial void OnDoButtonClick(NSObject sender)
{
DoButton.Enabled = false;
// Start animation
ProgressIndicator.StartAnimation(this);
ThreadPool.QueueUserWorkItem(_ => {
// Perform a task that last for about a second:
Thread.Sleep(1 * 1000);
// Stop animation:
InvokeOnMainThread(() => {
ProgressIndicator.StopAnimation(this);
DoButton.Enabled = true;
});
});
}
However, when i run the code by pressing the button, the main thread stops the following error occurs:
(lldb) quit* thread #1: tid = 0x2bf20, 0x98fd9f7a libsystem_kernel.dylib`mach_msg_trap + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
And, the following log is recorded in the system log:
2014/05/21 13:10:51.752 com.apple.debugserver-310.2[3553]: 1 +0.000001 sec [0de1/1503]: error: ::read ( 0, 0x107557a40, 1024 ) => -1 err = Connection reset by peer (0x00000036)
2014/05/21 13:10:51.752 com.apple.debugserver-310.2[3553]: 2 +0.000001 sec [0de1/0303]: error: ::ptrace (request = PT_THUPDATE, pid = 0x0ddc, tid = 0x1a03, signal = -1) err = Invalid argument (0x00000016)
2014/05/21 13:10:51.753 com.apple.debugserver-310.2[3553]: Exiting.
2014/05/21 13:11:05.000 kernel[0]: process <AppName>[3548] caught causing excessive wakeups. Observed wakeups rate (per sec): 1513; Maximum permitted wakeups rate (per sec): 150; Observation period: 300 seconds; Task lifetime number of wakeups: 45061
2014/05/21 13:11:05.302 ReportCrash[3555]: Invoking spindump for pid=3548 wakeups_rate=1513 duration=30 because of excessive wakeups
2014/05/21 13:11:07.452 spindump[3556]: Saved wakeups_resource.spin report for <AppName> version 1.2.1.0 (1) to /Library/Logs/DiagnosticReports/<AppName>_2014-05-21-131107_<UserName>-MacBook-Pro.wakeups_resource.spin
Extract from above: Maximum permitted wakeups rate (per sec): 150; Observation period: 300 seconds; Task lifetime number of wakeups: 45061
The problem does NOT happen if I remove the ProgressIndicator.StartAnimation(this); and ProgressIndicator.StopAnimation(this); lines.
Why is the main thread stopped by SIGSTOP?

Related

MacOS Catalina freezing+crashing after running Node.JS load test script

I wrote up a simple load testing script that runs N number of hits to and HTTP endpoint over M async parallel lanes. Each lane waits for the previous request to finish before starting a new request. The script, for my specific use-case, is randomly picking a numeric "width" parameter to add to the URL each time. The endpoint returns between 200k and 900k of image data on each request depending on the width parameter. But my script does not care about this data and simply relies on garbage collection to clean it up.
const fetch = require('node-fetch');
const MIN_WIDTH = 200;
const MAX_WIDTH = 1600;
const loadTestUrl = `
http://load-testing-server.com/endpoint?width={width}
`.trim();
async function fetchAll(url) {
const res = await fetch(url, {
method: 'GET'
});
if (!res.ok) {
throw new Error(res.statusText);
}
}
async function doSingleRun(runs, id) {
const runStart = Date.now();
console.log(`(id = ${id}) - Running ${runs} times...`);
for (let i = 0; i < runs; i++) {
const start = Date.now();
const width = Math.floor(Math.random() * (MAX_WIDTH - MIN_WIDTH)) + MIN_WIDTH;
try {
const result = await fetchAll(loadTestUrl.replace('{width}', `${width}`));
const duration = Date.now() - start;
console.log(`(id = ${id}) - Width ${width} Success. ${i+1}/${runs}. Duration: ${duration}`)
} catch (e) {
const duration = Date.now() - start;
console.log(`(id = ${id}) - Width ${width} Error fetching. ${i+1}/${runs}. Duration: ${duration}`, e)
}
}
console.log(`(id = ${id}) - Finished run. Duration: ` + (Date.now() - runStart));
}
(async function () {
const RUNS = 200;
const parallelRuns = 10;
const promises = [];
const parallelRunStart = Date.now();
console.log(`Running ${parallelRuns} parallel runs`)
for (let i = 0; i < parallelRuns; i++) {
promises.push(doSingleRun(RUNS, i))
}
await Promise.all(promises);
console.log(`Finished parallel runs. Duration ${Date.now() - parallelRunStart}`)
})();
When I run this in Node 14.17.3 on my MacBook Pro running MacOS 10.15.7 (Catalina) with even a modest parallel lane number of 3, after about 120 (x 3) hits of the endpoint the following happens in succession:
Console output ceases in the terminal for the script, indicating the script has halted
Other applications such as my browser are unable to make network connections.
Within 1 - 2 mins other applications on my machine begin to slow down and eventually freeze up.
My entire system crashes with a kernel panic and the machine reboots.
panic(cpu 2 caller 0xffffff7f91ba1ad5): userspace watchdog timeout: remoted connection watchdog expired, no updates from remoted monitoring thread in 60 seconds, 30 checkins from thread since monitoring enabled 640 seconds ago after loadservice: com.apple.logd, total successful checkins since load (642 seconds ago): 64, last successful checkin: 10 seconds ago
service: com.apple.WindowServer, total successful checkins since load (610 seconds ago): 60, last successful checkin: 10 seconds ago
I can very easily stop of the progression of these symptoms by doing a Ctrl+C in the terminal of my script and force quitting it. Everything quickly gets back to normal. And I can repeat the experiment multiple times before allowing it to crash my machine.
I've monitored Activity Monitor during the progression and there is very little (~1%) CPU usage, memory usage reaches up to maybe 60-70mb, though it is pretty evident that the Network activity is peaking during the script's run.
In my search for others with this problem there were only two Stack Overflow articles that came close:
node.js hangs other programs on my mac
Node script causes system freeze when uploading a lot of files
Anyone have any idea why this would happen? It seems very dangerous that a single app/script could so easily bring down a machine without being killed first by the OS.

ZIO scala sleep method not sleeping the thread vs. using directly Thread.sleep

In my existing Scala code I replaced Thread.sleep(10000) with ZIO.sleep(Duration.fromScala(10.seconds)) with the understanding that it won't block thread from the thread pool (performance issue). When program runs it does not wait at this line (whereas of course in first case it does). Do I need to add any extra code for ZIO method to work ?
Adding code section from Play+Scala code:
def sendMultipartEmail = Action.async(parse.multipartFormData) { request =>
.....
//inside this controller below method is called
def retryEmailOnFail(pList: ListBuffer[JsObject], content: String) = {
if (!sendAndGetStatus(pList, content)) {
println("<--- email sending failed - retry once after a delay")
ZIO.sleep(Duration.fromScala(10.seconds))
println("<--- retrying email sending after a delay")
finalStatus = finalStatus && sendAndGetStatus(pList, content)
} else {
finalStatus = finalStatus && true
}
}
.....
}
As you said, ZIO.sleep will only suspend the fiber that is running, not the operating system thread.
If you want to start something after sleeping, you should just chain it after the sleep:
// value 42 will only be computed after waiting for 10s
val io = ZIO.sleep(Duration.fromScala(10.seconds)).map(_ => 42)

Golang Memory Leak Concerning Goroutines

I have a Go program that runs continuously and relies entirely on goroutines + 1 manager thread. The main thread simply calls goroutines and otherwise sleeps.
There is a memory leak. The program uses more and more memory until it drains all 16GB RAM + 32GB SWAP and then each goroutine panics. It is actually OS memory that causes the panic, usually the panic is fork/exec ./anotherapp: cannot allocate memory when I try to execute anotherapp.
When this happens all of the worker threads will panic and be recovered and restarted. So each goroutine will panic, be recovered and restarted... at which point the memory usage will not decrease, it remains at 48GB even though there is now virtually nothing allocated. This means all goroutines will always panic as there is never enough memory, until the entire executable is killed and restarted completely.
The entire thing is about 50,000 lines, but the actual problematic area is as follows:
type queue struct {
identifier string
type bool
}
func main() {
// Set number of gorountines that can be run
var xthreads int32 = 10
var usedthreads int32
runtime.GOMAXPROCS(14)
ready := make(chan *queue, 5)
// Start the manager goroutine, which prepared identifiers in the background ready for processing, always with 5 waiting to go
go manager(ready)
// Start creating goroutines to process as they are ready
for obj := range ready { // loops through "ready" channel and waits when there is nothing
// This section uses atomic instead of a blocking channel in an earlier attempt to stop the memory leak, but it didn't work
for atomic.LoadInt32(&usedthreads) >= xthreads {
time.Sleep(time.Second)
}
debug.FreeOSMemory() // Try to clean up the memory, also did not stop the leak
atomic.AddInt32(&usedthreads, 1) // Mark goroutine as started
// Unleak obj, probably unnecessary, but just to be safe
copy := new(queue)
copy.identifier = unleak.String(obj.identifier) // unleak is a 3rd party package that makes a copy of the string
copy.type = obj.type
go runit(copy, &usedthreads) // Start the processing thread
}
fmt.Println(`END`) // This should never happen as the channels are never closed
}
func manager(ready chan *queue) {
// This thread communicates with another server and fills the "ready" channel
}
// This is the goroutine
func runit(obj *queue, threadcount *int32) {
defer func() {
if r := recover(); r != nil {
// Panicked
erstring := fmt.Sprint(r)
reportFatal(obj.identifier, erstring)
} else {
// Completed successfully
reportDone(obj.identifier)
}
atomic.AddInt32(threadcount, -1) // Mark goroutine as finished
}()
do(obj) // This function does the actual processing
}
As far as I can see, when the do function (last line) ends, either by having finished or having panicked, the runit function then ends, which ends the goroutine entirely, which means all of the memory from that goroutine should now be free. This is now what happens. What happens is that this app just uses more and more and more memory until it becomes unable to function, all the runit goroutines panic, and yet the memory does not decrease.
Profiling does not reveal anything suspicious. The leak appears to be outside of the profiler's scope.
Please consider inverting the pattern, see here or below....
package main
import (
"log"
"math/rand"
"sync"
"time"
)
// I do work
func worker(id int, work chan int) {
for i := range work {
// Work simulation
log.Printf("Worker %d, sleeping for %d seconds\n", id, i)
time.Sleep(time.Duration(rand.Intn(i)) * time.Second)
}
}
// Return some fake work
func getWork() int {
return rand.Intn(2) + 1
}
func main() {
wg := new(sync.WaitGroup)
work := make(chan int)
// run 10 workers
for i := 0; i < 10; i++ {
wg.Add(1)
go func(i int) {
worker(i, work)
wg.Done()
}(i)
}
// main "thread"
for i := 0; i < 100; i++ {
work <- getWork()
}
// signal there is no more work to be done
close(work)
// Wait for the workers to exit
wg.Wait()
}

Vxworks getting stuck in memory routines

I'm running vxWorks 6.3 and have run into a problem. I have a series of tasks running as in an RTP. I create a task, do stuff then destroy the task. Then create two tasks, very close together, do some stuff and destroy them. These tasks have to do crazy things like, malloc and free memory. Unfortunately, if I do this enough times, one of the tasks will get stuck in the memory (both malloc and free) routines on a semaphore. It's always the second task that gets "lost" at the very start of the task in either free or malloc. After the failure, I can still create tasks and I can still malloc memory. The failing task sits forever, waiting for the semaphore... A semaphore that other tasks MUST be using.
Does anyone have any idea how a task can get stuck in the memory routines?
0x08265e58 malloc +0x2c : 0x082416f4 ()
0x08267e50 memPartAlloc +0x28 : 0x08241734 ()
0x08267e0c memPartAlignedAlloc+0x70 : 0x08267c04 ()
0x08267c7c memPartFree +0xfc : 0x08240654 ()
0x082753c0 semTake +0x90 : 0x08242534 ()
0x082752ec semUMTake +0xd8 : 0x08242514 ()
---- system call boundary ----
-> tw 0x69d21b0
NAME ENTRY TID STATUS DELAY OBJ_TYPE OBJ_ID OBJ_NAME
---------- ---------- ---------- ---------- ----- ---------- ---------- --------
tHttp631-2 0x827dbfc 0x69d21b0 PEND 0 SEM_M 0x6859650 N/A
Semaphore Id : 0x6859650
Semaphore Type : MUTEX
Task Queuing : PRIORITY
Pended Tasks : 1
Owner : 0x69d1a08 Deleted!
Options : 0xd SEM_Q_PRIORITY
SEM_DELETE_SAFE
SEM_INVERSION_SAFE
VxWorks Events
--------------
Registered Task : NONE
Event(s) to Send : N/A
Options : N/A
Pended Tasks
------------
NAME TID PRI TIMEOUT
---------- -------- --- -------
tHttp631-25502 69d21b0 120 0
value = 0 = 0x0
->
It is recommended that you allocate enough memory for the worst case at init time, and then just re-use that memory throughout the duration of your program. Especially if you actually have real time requirements as malloc/free are non-deterministic operations, I also recommend re-using the tasks rather that recreating new tasks at runtime, then use a semaphore or msgQueue to kick off the appropriate tasks at the appropriate times. So your program flow might look something like this:
initTime()
{
t1mem = malloc(t1memSize);
t2mem = malloc(t2memSize);
t3mem = malloc(t3memSize);
t1q = msgQCreate(qlen, msglen, MSG_Q_FIFO);
t2q = msgQCreate(qlen, msglen, MSG_Q_FIFO);
t3q = msgQCreate(qlen, msglen, MSG_Q_FIFO);
rspq = msgQCreate(qlen, msglen, MSG_Q_FIFO);
taskSpawn("t1", t1pri, ..., t1Entry, t1mem, t1q, rspq, ...);
taskSpawn("t2", t2pri, ..., t2Entry, t2mem, t2q, rspq, ...);
taskSpawn("t3", t3pri, ..., t3Entry, t3mem, t3q, rspq, ...);
runTime(t1sem, t2sem, t3sem, rspq);
msgQDelete(t1q);
msgQDelete(t2q);
msgQDelete(t3q);
msgQDelete(rspq);
free(t1mem);
free(t2mem);
free(t3mem);
}
runTime(MSG_Q_ID t1q, MSG_Q_ID t2q, MSG_Q_ID t3q, MSG_Q_ID rspq)
{
while (programRun)
{
tasksDone = 0;
msgQSend(t1q, t1start, msglen, 100, MSG_PRI_NORMAL);
if (msgQReceive(rspq, buf, msglen, errorCaseTimeout) == OK)
{
// check to make sure the msg is t1done...
// report error if it isn't...
msgQSend(t2q, t2start, msglen, 100, MSG_PRI_NORMAL);
msgQSend(t3q, t3start, msglen, 100, MSG_PRI_NORMAL);
for (int x = 0; x < 2; x++)
{
if (msgQReceive(rspq, buf, msglen, errorCaseTimeout) == OK)
{
// check to make sure the msg is t2done/t3done...
// report error if it isn't...
tasksDone++;
}
}
}
if (tasksDone == 2)
{
// everything is good... keep on running...
}
else
{
// a task didnt finish within the errorCaseTimeout time...
// report error or something, maybe set programRun to false...
}
}
}
t1Entry(void* mem, MSG_Q_ID q, MSG_Q_ID rspq)
{
while (programRun)
{
if (msgQReceive(q, buf, msglen, 100) == OK)
{
doTask1(mem);
msgQSend(rspq, t1done, msglen, 100, MSG_PRI_NORMAL);
}
}
}
t2Entry(void* mem, MSG_Q_ID q, MSG_Q_ID rspq)
{
while (programRun)
{
if (msgQReceive(q, buf, msglen, 100) == OK)
{
doTask2(mem);
msgQSend(rspq, t2done, msglen, 100, MSG_PRI_NORMAL);
}
}
}
t3Entry(void* mem, MSG_Q_ID q, MSG_Q_ID rspq)
{
while (programRun)
{
if (msgQReceive(q, buf, msglen, 100) == OK)
{
doTask3(mem);
msgQSend(rspq, t3done, msglen, 100, MSG_PRI_NORMAL);
}
}
}
Obviously the above code is not very DRY, and not all error cases are fully handled, but it is a start and has a good chance of working deterministically.
A few questions:
Are all the tasks created/deleted in an RTP?
How are you "destroying the task"?
When then task blocks, are new malloc/tasks creation in the same RTP or different RTP?
Are you deleting entire RTPs?
It sounds like you are using a taskDelete from one task to destroy those other tasks. If that's the case, then it is possible that a task is being deleted in the middle of a memory operation.
Since this is a malloc operation in an RTP, each RTP created contains it's own heap (malloc) semaphore. I would think this would be the semaphore being held.
I would suggest contacting Wind River support. This might be an issue they are familiar with.
This may be related to a problem I am having, although I am seeing a different symptom.
In both cases the owner of a semaphore is being deleted. In my case, I am having the tWebTask hang and tracing that to a missing semaphore owner on a web socket.
Here's the link to my SO question.

Example of dynamic thread pool in boost::asio

I'm going to implement boost::asio server with a thread pool using single io_service ( HTTP Server 3
example ). io_service will be bound to unix domain socket and pass requests going from connections on this socket to different threads. In order to reduce resource consumption I want to make the thread pool dynamic.
Here is a concept. Firstly a single thread is created. When a request arrives and server sees that there is no idle thread in a pool it creates a new thread and passes the request to it. The server can create up to some maximum number of threads. Ideally it should have functinality of suspending threads which are idle for some time.
Did somebody make something similar? Or maybe somebody has a relevant example?
As for me, I guess I should somehow override io_service.dispatch to achieve that.
There may be a few challenges with the initial approach:
boost::asio::io_service is not intended to be derived from or reimplemented. Note the lack of virtual functions.
If your thread library does not provide the ability to query a thread's state, then state information needs to be managed separately.
An alternative solution is to post a job into the io_service, then check how long it sat in the io_service. If the time delta between when it was ready-to-run and when it was actually ran is above a certain threshold, then this indicates there are more jobs in the queue than threads servicing the queue. A major benefit to this is that the dynamic thread pool growth logic becomes decoupled from other logic.
Here is an example that accomplishes this by using the deadline_timer.
Set deadline_timer to expire 3 seconds from now.
Asynchronously wait on the deadline_timer. The handler will be ready-to-run 3 seconds from when the deadline_timer was set.
In the asynchronous handler, check the current time relative to when the timer was set to expire. If it is greater than 2 seconds, then the io_service queue is backing up, so add a thread to the thread pool.
Example:
#include <boost/asio.hpp>
#include <boost/bind.hpp>
#include <boost/thread.hpp>
#include <iostream>
class thread_pool_checker
: private boost::noncopyable
{
public:
thread_pool_checker( boost::asio::io_service& io_service,
boost::thread_group& threads,
unsigned int max_threads,
long threshold_seconds,
long periodic_seconds )
: io_service_( io_service ),
timer_( io_service ),
threads_( threads ),
max_threads_( max_threads ),
threshold_seconds_( threshold_seconds ),
periodic_seconds_( periodic_seconds )
{
schedule_check();
}
private:
void schedule_check();
void on_check( const boost::system::error_code& error );
private:
boost::asio::io_service& io_service_;
boost::asio::deadline_timer timer_;
boost::thread_group& threads_;
unsigned int max_threads_;
long threshold_seconds_;
long periodic_seconds_;
};
void thread_pool_checker::schedule_check()
{
// Thread pool is already at max size.
if ( max_threads_ <= threads_.size() )
{
std::cout << "Thread pool has reached its max. Example will shutdown."
<< std::endl;
io_service_.stop();
return;
}
// Schedule check to see if pool needs to increase.
std::cout << "Will check if pool needs to increase in "
<< periodic_seconds_ << " seconds." << std::endl;
timer_.expires_from_now( boost::posix_time::seconds( periodic_seconds_ ) );
timer_.async_wait(
boost::bind( &thread_pool_checker::on_check, this,
boost::asio::placeholders::error ) );
}
void thread_pool_checker::on_check( const boost::system::error_code& error )
{
// On error, return early.
if ( error ) return;
// Check how long this job was waiting in the service queue. This
// returns the expiration time relative to now. Thus, if it expired
// 7 seconds ago, then the delta time is -7 seconds.
boost::posix_time::time_duration delta = timer_.expires_from_now();
long wait_in_seconds = -delta.seconds();
// If the time delta is greater than the threshold, then the job
// remained in the service queue for too long, so increase the
// thread pool.
std::cout << "Job job sat in queue for "
<< wait_in_seconds << " seconds." << std::endl;
if ( threshold_seconds_ < wait_in_seconds )
{
std::cout << "Increasing thread pool." << std::endl;
threads_.create_thread(
boost::bind( &boost::asio::io_service::run,
&io_service_ ) );
}
// Otherwise, schedule another pool check.
schedule_check();
}
// Busy work functions.
void busy_work( boost::asio::io_service&,
unsigned int );
void add_busy_work( boost::asio::io_service& io_service,
unsigned int count )
{
io_service.post(
boost::bind( busy_work,
boost::ref( io_service ),
count ) );
}
void busy_work( boost::asio::io_service& io_service,
unsigned int count )
{
boost::this_thread::sleep( boost::posix_time::seconds( 5 ) );
count += 1;
// When the count is 3, spawn additional busy work.
if ( 3 == count )
{
add_busy_work( io_service, 0 );
}
add_busy_work( io_service, count );
}
int main()
{
using boost::asio::ip::tcp;
// Create io service.
boost::asio::io_service io_service;
// Add some busy work to the service.
add_busy_work( io_service, 0 );
// Create thread group and thread_pool_checker.
boost::thread_group threads;
thread_pool_checker checker( io_service, threads,
3, // Max pool size.
2, // Create thread if job waits for 2 sec.
3 ); // Check if pool needs to grow every 3 sec.
// Start running the io service.
io_service.run();
threads.join_all();
return 0;
}
Output:
Will check if pool needs to increase in 3 seconds.
Job job sat in queue for 7 seconds.
Increasing thread pool.
Will check if pool needs to increase in 3 seconds.
Job job sat in queue for 0 seconds.
Will check if pool needs to increase in 3 seconds.
Job job sat in queue for 4 seconds.
Increasing thread pool.
Will check if pool needs to increase in 3 seconds.
Job job sat in queue for 0 seconds.
Will check if pool needs to increase in 3 seconds.
Job job sat in queue for 0 seconds.
Will check if pool needs to increase in 3 seconds.
Job job sat in queue for 0 seconds.
Will check if pool needs to increase in 3 seconds.
Job job sat in queue for 3 seconds.
Increasing thread pool.
Thread pool has reached its max. Example will shutdown.

Resources