IOCP multithreaded server and reference counted class - multithreading

I work on IOCP Server (Overlapped I/O , 4 threads, CreateIoCompletionPort, GetQueuedCompletionStatus, WSASend etc). And my goal is to send single reference counted buffer too all connected sockets.(I followed Len Holgate's suggestion from this post WSAsend to all connected socket in multithreaded iocp server) . After sending buffer to all connected clients it should be deleted.
this is class with buffer to be send
class refbuf
int m_nLength;
int m_wsk;
char *m_pnData; // buffer to send
mutable int mRefCount;
void grab() const
void release() const
if(mRefCount > 0);
if(mRefCount == 0) {delete (refbuf *)this;}
char* bufadr() { return m_pnData;}
sending buffer to all socket
refbuf *refb = new refbuf(4);
pTmp1 = g_pCtxtList; // start of linked list with sockets
while( pTmp1 )
pTmp2 = pTmp1->pCtxtBack;
ovl=TakeOvl(); // ovl -struct containing WSAOVERLAPPED
ovl->wsabuf.buf=refb->bufadr();// adress m_pnData from refbuf
ovl->rcb=refb; //when GQCS get notification rcb is used to decrease mRefCount
refb->grab(); // mRefCount ++
WSASend(pTmp1->Socket, &(ovl->wsabuf),1,&dwSendNumBytes,0,&(ovl->Overlapped), NULL);
pTmp1 = pTmp2;
and 1 of 4 threads
GetQueuedCompletionStatus(hIOCP, &dwIoSize,(PDWORD_PTR)&lpPerSocketContext, (LPOVERLAPPED *)&lpOverlapped, INFINITE);
lpIOContext = (PPER_IO_CONTEXT)lpOverlapped;
lpIOContext->rcb->release(); //mRefCount --,if mRefCount reach 0, delete object
i check this with 5 connected clients and it seems to work. When GQCS receives all notifaction, mRefCount reachs 0 and delete is executed.
And my questions: is that approach appropriate? What if there will be for example 100 or more clients? Is situation avoided when one thread can delete object before another still use it? How to implement atomic reference count in this scernario? Thanks in advance.

Obvious issues; in order of importance...
Your refbuf class doesn't use thread safe ref count manipulation. Use InterlockedIncrement() etc.
I assume that TakeOvl() obtains a new OVERLAPPED and WSABUF structure per operation.
Your naming could be better, why grab() rather than AddRef(), what does TakeOvl() take from? Those Tmp variables are something and the least important something is that they're 'temporary' so name them after a more important something. Go Read Code Complete.


What s the Windows exact equivalent of WaitOnAddress() on Linux?

Using shared memory with the shmget() system call, the aim of my C++ program, is to fetch a bid price from the Internet through a server written in Rust so that each times the value changes, I m performing a financial transaction.
Server pseudocode
Shared_struct.price = new_price
Client pseudocode
Wait until memory address pointed by Shared_struct.price changes.
Goto Infinite_loop
Since launching a transaction involve paying transaction fees, I want to create a transaction only once per buy price change.
Using a semaphore or a futex, I can do the reverse, I m meaning waiting for a variable to reachs a specific value, but how to wait until a variable is no longer equal to current value?
Whereas on Windows I can do something like this on the address of the shared segment:
ULONG g_TargetValue; // global, accessible to all process
ULONG CapturedValue;
ULONG UndesiredValue;
UndesiredValue = 0;
CapturedValue = g_TargetValue;
while (CapturedValue == UndesiredValue) {
WaitOnAddress(&g_TargetValue, &UndesiredValue, sizeof(ULONG), INFINITE);
CapturedValue = g_TargetValue;
Is there a way to do this on Linux? Or a straight equivalent?
You can use futex. (I assumed "var" is in shm mem)
/* Client */
int prv;
while (1) {
int prv = var;
int ret = futex(&var, FUTEX_WAIT, prv, NULL, NULL, 0);
/* Spurious wake-up */
if (!ret && var == prv) continue;
/* Server */
int prv = NOT_CACHED;
while(1) {
var = updateVar();
if (var != prv || prv = NOT_CACHED)
futex(&var, FUTEX_WAKE, 1, NULL, NULL, 0);
prv = var;
It requires the server side to call futex as well to notify client(s).
Note that the same holds true for WaitOnAddress.
According to MSDN:
Any thread within the same process that changes the value at the address on which threads are waiting should call WakeByAddressSingle to wake a single waiting thread or WakeByAddressAll to wake all waiting threads.
More high level synchronization method for this problem is to use condition variable.
It is also implemented based on futex.
See link

Boost ASIO as an event loop with boost lockfree queue for socket write

I am using boost ASIO for a TCP client. for the most part the ASIO is a glorified event loop for read and write. There is actually only one client managed by the ASIO.
The architecture is like this -
The TCP server streams continuous messages. The Client will read the messages, process it and ack back with proper code.
My code runs in client side. There is one thread running io_service. The io_service thread reads messages and distributes it to N number of worker threads using a boost lockfree SPSC queue. The workers after processing posts the replies to the io_service thread.
most important concern for me is the rate of read and write. So I am using synchronous reads and writes.
Read Code:
void read ()
if (_connected && !_readInProgress) {
[self = shared_from_this(), this] (ErrorType err, unsigned a)
_readInProgress = false;
if (err) disconnect();
else asyncRead();
_readInProgress = true;
Basically I use read_some with nullbuffer() and then directly use Unix system calls to read the messages. The read give N number of messages which are enqueued to threads in a loop.
I want use the boost SPSC queue in the reverse direction for writes to the socket from workers.
// Get the queue to post writes
auto getWriteQ ()
static thread_local auto q =
std::make_shared< LFQType >(_epoch);
return q;
So each thread gets a thread-local Q using getWriteQ. The writes to the queue looks like this:
void write (Buf& buf) override
auto q = getWriteQ();
while (!q->enqueue(buf) && _connected);
if (!_connected) return; [self = shared_from_this(), this, q]()
writeHelper(q); });
Now this is inefficent, as we do a ioservice post for each write. The write handler at a time actually writes upto 32 messages in a single system-call using sendmmsg()
So I am looking for help with 2 things:
Is the design any good?
Any fool proof way to minimize the no. of posts. I was thinking of keep an atomic enqueue count. The worker thread will do this -
the writing thread does this - (Pseudo code)
bool post = false;
if(enqueue_count == 0) post = true
// enqueue the message
// post the queue event
The io-service thread does this -
enqueue_count -= num_processed;
if (enqueue_count)
// repost the queue for further processing
Would this work if the enqueue_count is atomic ?

Dynamic pool of processes

I'm writing a client-server (TCP) program in C on a Unix system. The client sends some information and the server answers. There's only one connection per child process. New connections use pre-running processes from a pool, and the pool size is dynamic, so if the number of free processes (processes not servicing a client) drops too low, it should create new processes, and likewise if it gets too high extra processes should be terminated.
This is my server code. Every connection make a new child process using fork(). Each connection runs in a new process. How can I make a dynamic pool like I explained above?
int main(int argc, char * argv[])
int cfd;
int listener = socket(AF_INET, SOCK_STREAM, 0); //create listener socket
if(listener < 0){
perror("socket error");
return 1;
struct sockaddr_in addr;
addr.sin_family = AF_INET;
addr.sin_port = htons(PORT);
addr.sin_addr.s_addr = htonl(INADDR_ANY);
int binding = bind(listener, (struct sockaddr *)&addr, sizeof(addr));
if(binding < 0){
perror("binding error");
return 1;
listen(listener, 1); //listen for new clients
int pid;
for(;;) // infinity loop on server
cfd = accept(listener, NULL, NULL); //client socket descriptor
pid = fork(); //make child proc
if(pid == 0) //in child proc...
close(listener); //close listener socket descriptor
... //some server actions that I do.(receive or send)
close(cfd); // close client fd
return 0;
If you have several processes blocked in accept on the same listen socket, then a new connection that comes in will get delivered to one of them. (Depending, several may wake up, but only one will actually get the connection). So you need to fork several children after listen, but before accept. After handling a request, the child goes back to accept instead of exit. That handles (1) and (2).
(3) is harder. You need some form of IPC. Typically, you'd have a parent process that just manages having the right number of children. Your child processes need to use IPC to tell the parent how busy they are. The parent can then either fork more children (which go into the accept loop above) or send signals to children to tell them to finish up and exit. It should also handle waiting on children, handle unexpected deaths, etc.
The IPC you want to use is probably shared memory. Your two options are SysV (shmget) and POSIX (shm_open`) shared memory. You probably want the latter if available. You'll have to deal with synchronizing access (both POSIX and SysV provide semaphores to help with this, again prefer POSIX) or using atomic access only.
(You probably don't actually want a process to exit the instant there are more than X free children, that'll lead to repeatedly reaping and spawning them, which is expensive. Instead you probably want some measure of how utilized they were over the last second... So your data is more complicated than a bitmap of in use/free.)
There are a lot of daemons that work like this, so you can fairly easily find code examples. Of course, if you go look at Apache, you'll probably find it more complicated, to get good performance and be portable everywhere.

There's no sleep()/wait for mutex in node.js, so how to deal with large IO tasks?

I have a large array of filenames I need to check, but I also need to respond to network clients. The easiest way is to perform:
for(var i=0;i < array.length;i++) {
fs.readFile(array[i], function(err, data) {...});
, but array can be of any length, say 100000, so it's not a good idea to perform 100000 reads at once, on the other hand doing fs.readFileSync() can take too long. Also launching next fs.readFile() in callback, like this:
var Idx = 0;
function checkFile() {
fs.readFile(array[Idx], function (err, data) {
if (Idx < array.length) {
} else {
Idx = 0;
setTimeout(checkFile, 10000); // start checking files in one second
is also not a best option, because array[] gets constantly updated by network clients - some items deleted, new added and so on.
What is the best way to accomplish such a task in node.js?
You should stick to your first solution (fs.readFile). For file I/O, node.js uses a thread pool. The reason is that most unix kernels don't provide efficient asynchronous APIs for the file system. Even if you start 10,000 reads concurrently, only a few reads will actually run and the rest will wait in a queue.
In order to make this answer more interesting, I browsed through node's code again to make sure that things hadn't changed.
Long story short, file I/O uses blocking system calls and is made by a thread pool with at most 4 concurrent threads.
The important code is in libeio, which is abstracted by libuv. All I/O code is wrapped by macros which queue requests. For example:
eio_req *eio_read (int fd, void *buf, size_t length, off_t offset, int pri, eio_cb cb, void *data, eio_channel *channel)
REQ (EIO_READ); req->int1 = fd; req->offs = offset; req->size = length; req->ptr2 = buf; SEND;
REQ prepares the request and SEND queues it. We eventually end up in etp_maybe_start_thread:
static unsigned int started, idle, wanted = 4;
static void
etp_maybe_start_thread (void)
if (ecb_expect_true (etp_nthreads () >= wanted))
The queue keeps 4 threads running to process the requests. When our read request is finally executed, eio simply use the block read from unistd.h:
case EIO_READ: ALLOC (req->size);
req->result = req->offs >= 0
? pread (req->int1, req->ptr2, req->size, req->offs)
: read (req->int1, req->ptr2, req->size); break;

How to asynchronously read to std::string using Boost::asio?

I'm learning Boost::asio and all that async stuff. How can I asynchronously read to variable user_ of type std::string? Boost::asio::buffer(user_) works only with async_write(), but not with async_read(). It works with vector, so what is the reason for it not to work with string? Is there another way to do that besides declaring char user_[max_len] and using Boost::asio::buffer(user_, max_len)?
Also, what's the point of inheriting from boost::enable_shared_from_this<Connection> and using shared_from_this() instead of this in async_read() and async_write()? I've seen that a lot in the examples.
Here is a part of my code:
class Connection
Connection(tcp::acceptor &acceptor) :
socket_(acceptor.get_io_service(), tcp::v4())
{ }
void start()
boost::bind(&Connection::start_accept, this));
void start_accept()
boost::bind(&Connection::handle_accept, this,
void handle_accept(const boost::system::error_code& err)
if (err)
async_read(socket_, boost::asio::buffer(user_),
boost::bind(&Connection::handle_user_read, this,
placeholders::error, placeholders::bytes_transferred));
void handle_user_read(const boost::system::error_code& err,
std::size_t bytes_transferred)
if (err)
void disconnect()
tcp::acceptor &acceptor_;
tcp::socket socket_;
std::string user_;
std::string pass_;
The Boost.Asio documentation states:
A buffer object represents a contiguous region of memory as a 2-tuple consisting of a pointer and size in bytes. A tuple of the form {void*, size_t} specifies a mutable (modifiable) region of memory.
This means that in order for a call to async_read to write data to a buffer, it must be (in the underlying buffer object) a contiguous block of memory. Additionally, the buffer object must be able to write to that block of memory.
std::string does not allow arbitrary writes into its buffer, so async_read cannot write chunks of memory into a string's buffer (note that std::string does give the caller read-only access to the underlying buffer via the data() method, which guarantees that the returned pointer will be valid until the next call to a non-const member function. For this reason, Asio can easily create a const_buffer wrapping an std::string, and you can use it with async_write).
The Asio documentation has example code for a simple "chat" program (see that has a good method of overcoming this problem. Basically, you need to have the sending TCP send along the size of a message first, in a "header" of sorts, and your read handler must interpret the header to allocate a buffer of a fixed size suitable for reading the actual data.
As far as the need for using shared_from_this() in async_read and async_write, the reason is that it guarantees that the method wrapped by boost::bind will always refer to a live object. Consider the following situation:
Your handle_accept method calls async_read and sends a handler "into the reactor" - basically you've asked the io_service to invoke Connection::handle_user_read when it finishes reading data from the socket. The io_service stores this functor and continues its loop, waiting for the asynchronous read operation to complete.
After your call to async_read, the Connection object is deallocated for some reason (program termination, an error condition, etc.)
Suppose the io_service now determines that the asynchronous read is complete, after the Connection object has been deallocated but before the io_service is destroyed (this can occur, for example, if io_service::run is running in a separate thread, as is typical). Now, the io_service attempts to invoke the handler, and it has an invalid reference to a Connection object.
The solution is to allocate Connection via a shared_ptr and use shared_from_this() instead of this when sending a handler "into the reactor" - this allows io_service to store a shared reference to the object, and shared_ptr guarantees that it won't be deallocated until the last reference expires.
So, your code should probably look something like:
class Connection : public boost::enable_shared_from_this<Connection>
Connection(tcp::acceptor &acceptor) :
socket_(acceptor.get_io_service(), tcp::v4())
{ }
void start()
boost::bind(&Connection::start_accept, shared_from_this()));
void start_accept()
boost::bind(&Connection::handle_accept, shared_from_this(),
void handle_accept(const boost::system::error_code& err)
if (err)
async_read(socket_, boost::asio::buffer(user_),
boost::bind(&Connection::handle_user_read, shared_from_this(),
placeholders::error, placeholders::bytes_transferred));
Note that you now must make sure that each Connection object is allocated via a shared_ptr, e.g.:
boost::shared_ptr<Connection> new_conn(new Connection(...));
Hope this helps!
This isn't intended to be an answer per se, but just a lengthy comment: a very simple way to convert from an ASIO buffer to a string is to stream from it:
asio::streambuf buff;
asio::read_until(source, buff, '\r'); // for example
istream is(&buff);
is >> targetstring;
This is a data copy, of course, but that's what you need to do if you want it in a string.
You can use a std:string with async\_read() like this:
async_read(socket_, boost::asio::buffer(&user_[0], user_.size()),
boost::bind(&Connection::handle_user_read, this,
placeholders::error, placeholders::bytes_transferred));
However, you'd better make sure that the std::string is big enough to accept the packet that you're expecting and padded with zeros before calling async\_read().
And as for why you should NEVER bind a member function callback to a this pointer if the object can be deleted, a more complete description and a more robust method can be found here: Boost async_* functions and shared_ptr's.
Boost Asio has two styles of buffers. There's boost::asio::buffer(your_data_structure), which cannot grow, and is therefore generally useless for unknown input, and there's boost::asio::streambuf which can grow.
Given a boost::asio::streambuf buf, you turn it into a string with std::string(std::istreambuf_iterator<char>(&buf), {});.
This is not efficient as you end up copying data once more, but that would require making boost::asio::buffer aware of growable containers, i.e. containers that have a .resize(N) method. You can't make it efficient without touching Boost code.
