Let me put my question as simple as below. Mine is a network router software built in erlang, but at a particular scenario I am observing very high memory growth as shown by VM.
I have one process which receives binary packet from some other process from socket.
This process, parses the binary packet and passes the binary packet to a gen_server (handle_cast is called)
The gen_server again stores some information in ETS table and send the packet to the peer server.
When the peer server responds back the entry from the ETS is deleted and the gen_server responds back to the first process
Also if the first process (which sent packet to gen_server) gets timedout after 5 seconds waiting for response from gen_server , it also deletes the ETS entry in the gen_server and exits.
Now I am observing high memory growth when lots of events gets timed out (due to unavailability of peer server) and from what i have researched its the "**binary**" and "**processes_used**" given by erlang:memory command thats using most of the memory.
but the same is not true when events are processed successfully.
The memory lost can be basically only in three places:
The state of your gen_server
look at you state, find out if there is some big or growing stuff there
Your processes mailboxes
see to it that there is some way to always drain unmatched messages (for gen_server handle_info callback) in normal receives a Any ->clause.
if the mailbox only fills up temporarily its probably because of the receiving process being too slow for the rate of messages produced. This is usually a problem for asynchronous communication. If its only temporary bursts that don't break anything this could be intended.
In this case you can either optimize the receiving process
or fix your protocol to use fewer messages
if you have multiple functions that receive some messages, make sure all receiving parts are being called regularly. Dont forget the Any -> clauses.
Be aware that while you are processing in a gen_servers callback no messages will be received, so if you need more time in a callback that would be necessary asyncronous messages might pile up (e.g. random message arrival + fixed processing time builds a unbounded growing queue, for details see Queueing theory
In your ETS table
maybe the information in the ETS is not be completely removed? Forgot to remove something in certain cases?
trigger GC by hand and see what happens with memory.
[garbage_collect(Pid) || Pid <- processes()]
Most likely you're leaving processes running that has references to binaries. If a process dies, all memory related to that process will be cleaned up (including any binaries that only belonged to that process).
If you still have leaking binaries, it means you have some long running process (server, singleton etc) that keeps references to binaries, either in its process state or by non-tail recursive functions. Make sure you clean up your state once process communication times out or they die. Also, check that you don't leave references to binaries on the heap by using non-tail recursive calls.
Related
I have a (Posix) server that acts as a proxy for many clients to another upstream server. Messages typically flow down from the upstream server, are then matched against, and pushed out to some subset of the clients interested in that traffic (maintaining the FIFO order from the upstream server). Currently, this proxy server is single threaded using an event loop (e.g. - select, epoll, etc.), but now I'd like to make it multithreaded so that the proxy can more fully utilize an entire machine and achieve much higher throughput.
My high level design is to have a pool of N worker pthreads (where N is some small multiple of the number of cores on the machine) who each run their own event loop. Each client connection will be assigned to a specific worker thread who would then be responsible for servicing all of that client's I/O + timeout needs for the duration of that client connection. I also intend to have a single dedicated thread who pulls in the messages in from the upstream server. Once a message is read in, its contents can be considered constant / unchanging, until it is no longer needed and reclaimed. The workers never alter the message contents -- they just pass them along to their clients as needed.
My first question is: should the matching of client interests preferably be done by the producer thread or the worker threads?
In the former approach, for each worker thread, the producer could check the interests (e.g. - group membership) of the worker's clients. If the message matched any clients, then it could push the message onto a dedicated queue for that worker. This approach requires some kind of synchronization between the producer and each worker about their client's rarely changing interests.
In the latter approach, the producer just pushes every message onto some kind of queue shared by all of the worker threads. Then each worker thread checks ALL of the messages for a match against their clients' interests and processes each message that matches. This is a twist on the usual SPMC problem where a consumer is usually assumed to unilaterally take an element for themselves, rather than all consumers needing to do some processing on every element. This approach distributes the matching work across multiple threads, which seems desirable, but I worry it may cause more contention between the threads depending on how we implement their synchronization.
In both approaches, when a message is no longer needed by any worker thread, it then needs to be reclaimed. So, some tracking needs to be done to know when no worker thread needs a message any longer.
My second question is: what is a good way of tracking whether a message is still needed by any of the worker threads?
A simple way to do this would be to assign to each message a count of how many worker threads still need to process the message when it is first produced. Then, when each worker is done processing a message it would decrement the count in a thread-safe manner and if/when the count went to zero we would know it could be reclaimed.
Another way to do this would be to assign 64b sequence numbers to the messages as they came in, then each thread could track and record the highest sequence number up through which they have processed somehow. Then we could reclaim all messages with sequence numbers less than or equal to the minimum processed sequence number across all of the worker threads in some manner.
The latter approach seems like it could more easily allow for a lazy reclamation process with less cross-thread synchronization necessary. That is, you could have a "clean-up" thread that only runs periodically who goes and computes the minimum across the worker threads, with much less inter-thread synchronization being necessary. For example, if we assume that reads and writes of a 64b integer are atomic and a worker's fully processed sequence number is always monotonically increasing, then the "clean-up" thread can just periodically read the workers' fully processed counts (maybe with some memory barrier) and compute the minimum.
Third question: what is the best way for workers to realize that they have new work to do in their queue(s)?
Each worker thread is going to be managing its own event loop of client file descriptors and timeouts. Is it best for each worker thread to just have their own pipe to which signal data can be written by the producer to poke them into action? Or should they just periodically check their queue(s) for new work? Are there better ways to do this?
Last question: what kind of data structure and synchronization should I use for the queue(s) between the producer and the consumer?
I'm aware of lock-free data structures but I don't have a good feel for whether they'd be preferable in my situation or if I should instead just go with a simple mutex for operations that affect the queue. Also, in the shared queue approach, I'm not entirely sure how a worker thread should track "where" it is in processing the queue.
Any insights would be greatly appreciated! Thanks!
Based on your problem description, matching of client interests needs to be done for each client for each message anyway, so the work in matching is the same whichever type of thread it occurs in. That suggests the matching should be done in the client threads to improve concurrency. Synchronization overhead should not be a major issue if the "producer" thread ensures the messages are flushed to main memory (technically, "synchronize memory with respect to other threads") before their availability is made known to the other threads, as the client threads can all read the information from main memory simultaneously without synchronizing with each other. The client threads will not be able to modify messages, but they should not need to.
Message reclamation is probably better done by tracking the current message number of each thread rather than by having a message specific counter, as a message specific counter presents a concurrency bottleneck.
I don't think you need formal queueing mechanisms. The "producer" thread can simply keep a volatile variable updated which contains the number of the most recent message that has been flushed to main memory, and the client threads can check the variable when they are free to do work, sleeping if no work is available. You could get more sophisticated on the thread management, but the additional efficiency improvement would likely be minor.
I don't think you need sophisticated data structures for this. You need volatile variables for the number of the latest message that is available for processing and for the number of the most recent message that have been processed by each client thread. You need to flush the messages themselves to main memory. You need some way of finding the messages in main memory from the message number, perhaps using a circular buffer of pointers, or of messages if the messages are all of the same length. You don't really need much else with respect to the data to be communicated between the threads.
We have a quite large multitasking communication system implemented on Vxworks 5.5 and PPC8260. The system should handle a lot of Ethernet traffic and also handle some cyclic peripheral control activities via RS-232, memory mapped I/O etc. What happens is that in some moment few message queues we are using for inter task communication become overflowed (I see it by log inspection). When I check the status of tasks responsible for serving this Message queues (that is doing receive on them) they appear to be READY.When I inspect msgQShow for the queues themselves they are full but no tasks appear to be blocked on them. But looking at task stack trace shows that a task actually pending inside msgQReceive call.Specifically in the qJobGet kernel call or something alike.
It is unlikely in the extreme that a message queue "lost" a task that was blocked on it.
From your description we can assume:
A message Queue has overflowed. Presumably you have detected this by checking the return value from msgQSend, which has been invoked either with a timeout value, or NO_WAIT.
msgQShow confirms the Q is full
Tasks that should be reading the queue are in the READY state.
The READY state is the state that tasks are in when they are available to run. Tasks are held in a queue (strictly, one queue per priority level), and when the reach the head of the queue they will be scheduled.
If tasks are persistently showing as READY, that suggests that they are not getting CPU time. The fact that the msgQ does not appear to empty supports that.
You should use tools such as system viewer to diagnose. You may need to raise the priority of the reader tasks. If your msgQSend is using NO_WAIT, you may need to use a timeout value
I am trying to write a test program that writes data across a TCP socket through Localhost on a Linux machine (CentOS 6.5 to be exact). I have one program writing, and one program reading. The reading is done via non-blocking recv() calls triggered from an epoll. There are enough cores on the machine to handle all CPU load without contention or scheduling issues.
The sends are buffers of smaller packets (about 100 bytes), aggregated up to 1400 bytes. Changing to aggregate larger (64K) buffers makes no apparent difference. When I do the sends, after 10s of MB of data, I start getting EAGAIN errors on the sender. Verifying via fcntl, I am definitiely configured as a blocking socket. I also noticed that calling ioctl(SIOCOUTQ) whenever I get EAGAIN yields larger and larger numbers.
The receiver is slower in processing the data read than the sender is in creating the data. Adding receiver threads is not an option. The fact that it is slower is OK assuming I can throttle the sender.
Now, my understanding of blocking sockets is that a send() should block in the send until the data goes out (this is based upon past experience) - meaning, the tcp stack should force the input side to be self-throttling.
Also, the Linux manual page for send() (and other queries on the web) indicate that EAGAIN is only returned for non-blocking sockets.
Can anyone give me some understanding as to what is going on, and how I can get my send() to actually block until enough data goes out for me to put more in. I would rather not have to rework the logic to put usleep() or other things in the code. It is a lightly loaded system, yield() is insufficient to allow it to drain.
As a side project I'm currently writing a server for an age-old game I used to play. I'm trying to make the server as loosely coupled as possible, but I am wondering what would be a good design decision for multithreading. Currently I have the following sequence of actions:
Startup (creates) ->
Server (listens for clients, creates) ->
Client (listens for commands and sends period data)
I'm assuming an average of 100 clients, as that was the max at any given time for the game. What would be the right decision as for threading of the whole thing? My current setup is as follows:
1 thread on the server which listens for new connections, on new connection create a client object and start listening again.
Client object has one thread, listening for incoming commands and sending periodic data. This is done using a non-blocking socket, so it simply checks if there's data available, deals with that and then sends messages it has queued. Login is done before the send-receive cycle is started.
One thread (for now) for the game itself, as I consider that to be separate from the whole client-server part, architecturally speaking.
This would result in a total of 102 threads. I am even considering giving the client 2 threads, one for sending and one for receiving. If I do that, I can use blocking I/O on the receiver thread, which means that thread will be mostly idle in an average situation.
My main concern is that by using this many threads I'll be hogging resources. I'm not worried about race conditions or deadlocks, as that's something I'll have to deal with anyway.
My design is setup in such a way that I could use a single thread for all client communications, no matter if it's 1 or 100. I've separated the communications logic from the client object itself, so I could implement it without having to rewrite a lot of code.
The main question is: is it wrong to use over 200 threads in an application? Does it have advantages? I'm thinking about running this on a multi-core machine, would it take a lot of advantage of multiple cores like this?
Thanks!
Out of all these threads, most of them will be blocked usually. I don't expect connections to be over 5 per minute. Commands from the client will come in infrequently, I'd say 20 per minute on average.
Going by the answers I get here (the context switching was the performance hit I was thinking about, but I didn't know that until you pointed it out, thanks!) I think I'll go for the approach with one listener, one receiver, one sender, and some miscellaneous stuff ;-)
use an event stream/queue and a thread pool to maintain the balance; this will adapt better to other machines which may have more or less cores
in general, many more active threads than you have cores will waste time context-switching
if your game consists of a lot of short actions, a circular/recycling event queue will give better performance than a fixed number of threads
To answer the question simply, it is entirely wrong to use 200 threads on today's hardware.
Each thread takes up 1 MB of memory, so you're taking up 200MB of page file before you even start doing anything useful.
By all means break your operations up into little pieces that can be safely run on any thread, but put those operations on queues and have a fixed, limited number of worker threads servicing those queues.
Update: Does wasting 200MB matter? On a 32-bit machine, it's 10% of the entire theoretical address space for a process - no further questions. On a 64-bit machine, it sounds like a drop in the ocean of what could be theoretically available, but in practice it's still a very big chunk (or rather, a large number of pretty big chunks) of storage being pointlessly reserved by the application, and which then has to be managed by the OS. It has the effect of surrounding each client's valuable information with lots of worthless padding, which destroys locality, defeating the OS and CPU's attempts to keep frequently accessed stuff in the fastest layers of cache.
In any case, the memory wastage is just one part of the insanity. Unless you have 200 cores (and an OS capable of utilizing) then you don't really have 200 parallel threads. You have (say) 8 cores, each frantically switching between 25 threads. Naively you might think that as a result of this, each thread experiences the equivalent of running on a core that is 25 times slower. But it's actually much worse than that - the OS spends more time taking one thread off a core and putting another one on it ("context switching") than it does actually allowing your code to run.
Just look at how any well-known successful design tackles this kind of problem. The CLR's thread pool (even if you're not using it) serves as a fine example. It starts off assuming just one thread per core will be sufficient. It allows more to be created, but only to ensure that badly designed parallel algorithms will eventually complete. It refuses to create more than 2 threads per second, so it effectively punishes thread-greedy algorithms by slowing them down.
I write in .NET and I'm not sure if the way I code is due to .NET limitations and their API design or if this is a standard way of doing things, but this is how I've done this kind of thing in the past:
A queue object that will be used for processing incoming data. This should be sync locked between the queuing thread and worker thread to avoid race conditions.
A worker thread for processing data in the queue. The thread that queues up the data queue uses semaphore to notify this thread to process items in the queue. This thread will start itself before any of the other threads and contain a continuous loop that can run until it receives a shut down request. The first instruction in the loop is a flag to pause/continue/terminate processing. The flag will be initially set to pause so that the thread sits in an idle state (instead of looping continuously) while there is no processing to be done. The queuing thread will change the flag when there are items in the queue to be processed. This thread will then process a single item in the queue on each iteration of the loop. When the queue is empty it will set the flag back to pause so that on the next iteration of the loop it will wait until the queuing process notifies it that there is more work to be done.
One connection listener thread which listens for incoming connection requests and passes these off to...
A connection processing thread that creates the connection/session. Having a separate thread from your connection listener thread means that you're reducing the potential for missed connection requests due to reduced resources while that thread is processing requests.
An incoming data listener thread that listens for incoming data on the current connection. All data is passed off to a queuing thread to be queued up for processing. Your listener threads should do as little as possible outside of basic listening and passing the data off for processing.
A queuing thread that queues up the data in the right order so everything can be processed correctly, this thread raises the semaphore to the processing queue to let it know there's data to be processed. Having this thread separate from the incoming data listener means that you're less likely to miss incoming data.
Some session object which is passed between methods so that each user's session is self contained throughout the threading model.
This keeps threads down to as simple but as robust a model as I've figured out. I would love to find a simpler model than this, but I've found that if I try and reduce the threading model any further, that I start missing data on the network stream or miss connection requests.
It also assists with TDD (Test Driven Development) such that each thread is processing a single task and is much easier to code tests for. Having hundreds of threads can quickly become a resource allocation nightmare, while having a single thread becomes a maintenance nightmare.
It's far simpler to keep one thread per logical task the same way you would have one method per task in a TDD environment and you can logically separate what each should be doing. It's easier to spot potential problems and far easier to fix them.
What's your platform? If Windows then I'd suggest looking at async operations and thread pools (or I/O Completion Ports directly if you're working at the Win32 API level in C/C++).
The idea is that you have a small number of threads that deal with your I/O and this makes your system capable of scaling to large numbers of concurrent connections because there's no relationship between the number of connections and the number of threads used by the process that is serving them. As expected, .Net insulates you from the details and Win32 doesn't.
The challenge of using async I/O and this style of server is that the processing of client requests becomes a state machine on the server and the data arriving triggers changes of state. Sometimes this takes some getting used to but once you do it's really rather marvellous;)
I've got some free code that demonstrates various server designs in C++ using IOCP here.
If you're using unix or need to be cross platform and you're in C++ then you might want to look at boost ASIO which provides async I/O functionality.
I think the question you should be asking is not if 200 as a general thread number is good or bad, but rather how many of those threads are going to be active.
If only several of them are active at any given moment, while all the others are sleeping or waiting or whatnot, then you're fine. Sleeping threads, in this context, cost you nothing.
However if all of those 200 threads are active, you're going to have your CPU wasting so much time doing thread context switches between all those ~200 threads.
I've got a RabbitMQ queue that might, at times, hold a considerable amount of data to process.
As far as I understand, using channel.consume will try to force the messages into the Node program, even if it's reaching its RAM limit (and, eventually, crash).
What is the best way to ensure workers get only as many tasks to process as they are capable of handling?
I'm thinking about using a chain of (transform) streams together with channel.get (which gets just one message). If the first stream's buffer is full, we simply stop getting messages.
I believe what you want is to specify the consumer prefetch.
This indicates to RabbitMQ how many messages it should "push" to the consumer at once.
An example is provided here
channel.prefetch(1);
Would be the lowest value to provide, and should ensure the least memory consumption for your node program.
This is based on your description, if my understanding is correct, I'd also recommend renaming your question (parallel processing would relate more to multiple consumers on a single queue, not a single consumer getting all the messages)