We have executed the script with 4k,5k, and 6k from 2 slaves(2k,2.5k,3k from each) and 1 master(receiving the response) slaves.
But the execution is stuck and has not been completed for the last 20-30 users. and also active thread count is showing in negative value.
So when this negative value is displayed in the active thread?
Why execution is getting stuck and not getting completed?
Also, we have sent 5k requests, and when we check in the HTML report, for some endpoints we observed that no of request is not sent matched as we sent.
The only way to determine why execution is "stuck" is taking a thread dump and seeing what exactly threads are doing.
Among the possible reasons are:
Your application fails to respond and by default JMeter will wait for the response forever, it makes sense to define reasonable timeouts in HTTP Request Defaults or equivalent configuration element.
You're violating JMeter Best Practices somehow. 3000 threads is quite high number for a single load generator so you need to properly tune JMeter for conducting such a load
JMeter doesn't have enough headroom to operate in terms of CPU, RAM, Network or Disk, make sure it has sufficient amount of resources as it may cause false negative test results or who knows what else (i.e. negative number of active threads). If you don't have any better monitoring toolchain in place you can go for JMeter PerfMon Plugin
Related
I know Rails 5 ships with Puma (which we're using) and will look for RAILS_MAX_THREADS as an environment variable or default to 5 threads, but I'm receiving timeout errors with the default value. I looked at my database and found its max connections is a few thousand.
It may be silly, but is this something Puma will set automatically and scale for, depending on its settings, or do I need to explicitly set this in the environment variables? If it needs to be manually set, what would be a good value for RAILS_MAX_THREADS?
I've found the following helpful, but I'm not fully grasping the scalability part:
https://devcenter.heroku.com/articles/deploying-rails-applications-with-the-puma-web-server
https://devcenter.heroku.com/articles/concurrency-and-database-connections
Puma has two parameters actually, the number of threads and the number of workers. If we slightly change the default puma.rb, it will look like that:
# WORKERS_NUM is not a default env variable name
workers Integer(ENV['WORKERS_NUM'] || 1)
max_threads_count = Integer(ENV['RAILS_MAX_THREADS'] || 1)
min_threads_count = max_threads_count
threads min_threads_count, max_threads_count
The number of workers is the number of separate processes that Puma spawns for you. Usually, it is a good idea to set it equal to the number of processor cores you have on your server. You could spawn more of them to allow for more requests to be processed simultaneously, but workers create additional memory overhead – each worker spins up a copy of your rails app, so usually, you would use threads to achieve higher throughput.
RAILS_MAX_THREADS is a way to set the number of threads each of your workers will use under the hood. In the example above, the min_threads_count is equal to the max_threads_count, so the number of threads is constant. If you set them to be different, it is going to scale from the min to the max, but I haven't seen it in the wild.
There are several reasons to limit the number of threads – your interpreter and response times:
If you use MRI, your threads are limited by GIL, so they're not run in parallel. MRI imitates parallel execution by context switching. A big number of threads will allow for much more simultaneous connections, but the average response time will increase because of the GIL.
Platform limits: i.e. heroku has thread number limits https://devcenter.heroku.com/articles/dynos#process-thread-limits, linux limits only the number of processes Maximum number of threads per process in Linux?.
When the code isn't thread-safe, there is a chance that using more than one thread will result in unpredictable problems. That's actually my case, so I didn't experiment much with the number of threads.
There was also an argument that slow IO blocks ruby process and doesn't allow context switching (i.e. calls to external services, or generating large files on the fly), but it turns out not to be true http://yehudakatz.com/2010/08/14/threads-in-ruby-enough-already/. But optimizing your architecture to do as much work in the background, as possible is always a good idea.
This answer will help you to find out a perfect combination of the number of threads vs the number of workers given your hardware.
This shows how the benchmarking could be done to find the exact numbers.
To sum up:
WORKERS_NUM multiplied by RAILS_MAX_THREADS gives you a maximum number of simultaneous connections that can be processed by puma. If the number is too low, your users will see timeouts during load spikes. To achieve the best performance given you use MRI, you need to set WORKERS_NUM to the number of cores and find optimal RAILS_MAX_THREADS based on average response time during performance tests.
Using LMAX Disruptor,we have observed that if we use like 5-10 disruptors together in an application (sort of like a chain of disruptors with every disruptor having one consumer on it performing a specified task and then handing over the message to the next disruptor/ ringbuffer), What happens is the CPU utilization reaches 90% and above and system becomes unresponsive until we bring down the application, We feel it's because of so many active disruptor threads. This happens even when the disruptors are not really processing anything. Can anyone comment on as to what should be the optimum number of disruptors to be used in an application?
It could be that you need to change the wait strategies you are using on the consumers. If you are using the busy-wait strategy on all of them, even if no inputs have been provided to the ring buffers, the polling threads could still tie up CPU resources because they'll be in tight loops where they're constantly checking the buffer for new values to read.
I'm writing a multi-threaded OpenMPI application, using MPI_Isend and MPI_Irecv from several threads to exchange hundreds of messages per second between ranks over InfiniBand RDMA.
Transfers are in the order of 400 - 800KByte, generating about 9 Gbps in and out for each rank, well within the capacity of FDR. Simple MPI benchmarks also show good performance.
The completion of the transfers is checked upon by polling all active transfers using MPI_Testsome in a dedicated thread.
The transfer rates I achieve depend on the message rate, but more importantly also on the polling frequency of MPI_Testsome. That is, if I poll, say, every 10ms, the requests finish later than if I poll every 1ms.
I'd expect that if I poll evert 10ms instead of every 1ms, I'd at most be informed of finished requests 9ms later. I'd not expect the transfers themselves to delay completion by fewer calls to MPI_Testsome, and thus slow down the total transfer rates. I'd expect MPI_Testsome to be entirely passive.
Anyone here have a clue why this behaviour could occur?
The observed behaviour is due to the way operation progression is implemented in Open MPI. Posting a send or receive, no matter if it is done synchronously or asynchronously, results in a series of internal operations being queued. Progression is basically the processing of those queued operations. There are two modes that you can select at library build time: one with asynchronous progression thread and one without with the latter being the default.
When the library is compiled with async progression thread enabled, a background thread takes care and processes the queue. This allows for background transfers to commence in parallel with the user's code but increases the latency. Without async progression, operations are faster but progression can only happen when the user code calls into the MPI library, e.g. while in MPI_Wait or MPI_Test and family. The MPI_Test family of functions are implemented in such a way as to return as fast as possible. That means that the library has to balance a trade-off between doing stuff in the call, thus slowing it down, or returning quickly, which means less operations are progressed on each call.
Some of the Open MPI developers, notably Jeff Squyres, visits Stack Overflow every now and then. He could possibly provide more details.
This behaviour is hardly specific to Open MPI. Unless heavily hardware-assisted, MPI is usually implemented following the same methods.
Ever since I discovered sockets, I've been using the nonblocking variants, since I didn't want to bother with learning about threading. Since then I've gathered a lot more experience with threading, and I'm starting to ask myself.. Why would you ever use it for sockets?
A big premise of threading seems to be that they only make sense if they get to work on their own set of data. Once you have two threads working on the same set of data, you will have situations such as:
if(!hashmap.hasKey("bar"))
{
dostuff // <-- meanwhile another thread inserts "bar" into hashmap
hashmap[bar] = "foo"; // <-- our premise that the key didn't exist
// (likely to avoid overwriting something) is now invalid
}
Now imagine hashmap to map remote IPs to passwords. You can see where I'm going. I mean, sure, the likelihood of such thread-interaction going wrong is pretty small, but it's still existent, and to keep one's program secure, you have to account for every eventuality. This will significantly increase the effort going into design, as compared to simple, single-threaded workflow.
I can completely see how threading is great for working on separate sets of data, or for programs that are explicitly optimized to use threading. But for the "general" case, where the programmer is only concerned with shipping a working and secure program, I can not find any reason to use threading over polling.
But seeing as the "separate thread" approach is extremely widespread, maybe I'm overlooking something. Enlighten me! :)
There are two common reasons for using threads with sockets, one good and one not-so-good:
The good reason: Because your computer has more than one CPU core, and you want to make use of the additional cores. A single-threaded program can only use a single core, so with a heavy workload you'd have one core pinned at 100%, and the other cores sitting unused and going to waste.
The not-so-good reason: You want to use blocking I/O to simplify your program's logic -- in particular, you want to avoid dealing with partial reads and partial writes, and keep each socket's context/state on the stack of the thread it's associated with. But you also want to be able to handle multiple clients at once, without slow client A causing an I/O call to block and hold off the handling of fast client B.
The reason the second reason is not-so-good is that while having one thread per socket seems to simplify the program's design, in practice it usually complicates it. It introduces the possibility of race conditions and deadlocks, and makes it difficult to safely access shared data (as you mentioned). Worse, if you stick with blocking I/O, it becomes very difficult to shut the program down cleanly (or in any other way effect a thread's behavior from anywhere other than the thread's socket), because the thread is typically blocked in an I/O call (possibly indefinitely) with no reliable way to wake it up. (Signals don't work reliably in multithreaded programs, and going back to non-blocking I/O means you lose the simplified program structure you were hoping for)
In short, I agree with cib -- multithreaded servers can be problematic and therefore should generally be avoided unless you absolutely need to make use of multiple cores -- and even then it might be better to use multiple processes rather than multiple threads, for safety's sake.
The biggest advantage of threads is to prevent the accumulated lag time from processing requests. When polling you use a loop to service every socket with a state change. For a handful of clients, this is not very noticeable, however it could lead to significant delays when dealing with significantly large number of clients.
Assuming that each transaction requires some pre-processing and post processing (depending on the protocol this may be trivial amount of processing, or it could be relatively significant as is the case with BEEP or SOAP). The combined time to pre-process/post-process requests could lead to a backlog of pending requests.
For illustration purposes imagine that the pre-processing, processing, and post-processing stage of a request each consumes 1 microsecond so that the total request takes 3 microseconds to complete. In a single threaded environment the system would become overwhelmed if incoming requests exceed 334 requests per second (since it would take 1.002 seconds to service all requests received within a 1 second period of time) leading to a time deficit of 0.002 seconds each second. However if the system were using threads, then it would be theoretically possible to only require 0.336 seconds * (0.334 for shared data access + 0.001 pre-processing + 0.001 post processing) of processing time to complete all of the requests received in a 1 second time period.
Although theoretically possible to process all requests in 0.336 seconds, this would require each request to have it's own thread. More reasonably would be to multiple the combined pre/post processing time (0.668 seconds) by the number of requests and divide by the number of configured threads. For example, using the same 334 incoming requests and processing time, theoritically 2 threads would complete all requests in 0.668 seconds (0.668 / 2 + 0.334), 4 threads in 0.501 seconds, and 8 threads in 0.418 seconds.
If the highest request volume your daemon receives is relatively low, then a single threaded implementation with non-blocking I/O is sufficient, however if you expect occasionally bursts of high volume of requests then it is worth considering a multi-threaded model.
I've written more than a handful of UNIX daemons which have relatively low throughput and I've used a single-threaded for the simplicity. However, when I wrote a custom netflow receiver for an ISP, I used a threaded model for the daemon and it was able to handle peak times of Internet usage with minimal bumps in system load average.
As a side project I'm currently writing a server for an age-old game I used to play. I'm trying to make the server as loosely coupled as possible, but I am wondering what would be a good design decision for multithreading. Currently I have the following sequence of actions:
Startup (creates) ->
Server (listens for clients, creates) ->
Client (listens for commands and sends period data)
I'm assuming an average of 100 clients, as that was the max at any given time for the game. What would be the right decision as for threading of the whole thing? My current setup is as follows:
1 thread on the server which listens for new connections, on new connection create a client object and start listening again.
Client object has one thread, listening for incoming commands and sending periodic data. This is done using a non-blocking socket, so it simply checks if there's data available, deals with that and then sends messages it has queued. Login is done before the send-receive cycle is started.
One thread (for now) for the game itself, as I consider that to be separate from the whole client-server part, architecturally speaking.
This would result in a total of 102 threads. I am even considering giving the client 2 threads, one for sending and one for receiving. If I do that, I can use blocking I/O on the receiver thread, which means that thread will be mostly idle in an average situation.
My main concern is that by using this many threads I'll be hogging resources. I'm not worried about race conditions or deadlocks, as that's something I'll have to deal with anyway.
My design is setup in such a way that I could use a single thread for all client communications, no matter if it's 1 or 100. I've separated the communications logic from the client object itself, so I could implement it without having to rewrite a lot of code.
The main question is: is it wrong to use over 200 threads in an application? Does it have advantages? I'm thinking about running this on a multi-core machine, would it take a lot of advantage of multiple cores like this?
Thanks!
Out of all these threads, most of them will be blocked usually. I don't expect connections to be over 5 per minute. Commands from the client will come in infrequently, I'd say 20 per minute on average.
Going by the answers I get here (the context switching was the performance hit I was thinking about, but I didn't know that until you pointed it out, thanks!) I think I'll go for the approach with one listener, one receiver, one sender, and some miscellaneous stuff ;-)
use an event stream/queue and a thread pool to maintain the balance; this will adapt better to other machines which may have more or less cores
in general, many more active threads than you have cores will waste time context-switching
if your game consists of a lot of short actions, a circular/recycling event queue will give better performance than a fixed number of threads
To answer the question simply, it is entirely wrong to use 200 threads on today's hardware.
Each thread takes up 1 MB of memory, so you're taking up 200MB of page file before you even start doing anything useful.
By all means break your operations up into little pieces that can be safely run on any thread, but put those operations on queues and have a fixed, limited number of worker threads servicing those queues.
Update: Does wasting 200MB matter? On a 32-bit machine, it's 10% of the entire theoretical address space for a process - no further questions. On a 64-bit machine, it sounds like a drop in the ocean of what could be theoretically available, but in practice it's still a very big chunk (or rather, a large number of pretty big chunks) of storage being pointlessly reserved by the application, and which then has to be managed by the OS. It has the effect of surrounding each client's valuable information with lots of worthless padding, which destroys locality, defeating the OS and CPU's attempts to keep frequently accessed stuff in the fastest layers of cache.
In any case, the memory wastage is just one part of the insanity. Unless you have 200 cores (and an OS capable of utilizing) then you don't really have 200 parallel threads. You have (say) 8 cores, each frantically switching between 25 threads. Naively you might think that as a result of this, each thread experiences the equivalent of running on a core that is 25 times slower. But it's actually much worse than that - the OS spends more time taking one thread off a core and putting another one on it ("context switching") than it does actually allowing your code to run.
Just look at how any well-known successful design tackles this kind of problem. The CLR's thread pool (even if you're not using it) serves as a fine example. It starts off assuming just one thread per core will be sufficient. It allows more to be created, but only to ensure that badly designed parallel algorithms will eventually complete. It refuses to create more than 2 threads per second, so it effectively punishes thread-greedy algorithms by slowing them down.
I write in .NET and I'm not sure if the way I code is due to .NET limitations and their API design or if this is a standard way of doing things, but this is how I've done this kind of thing in the past:
A queue object that will be used for processing incoming data. This should be sync locked between the queuing thread and worker thread to avoid race conditions.
A worker thread for processing data in the queue. The thread that queues up the data queue uses semaphore to notify this thread to process items in the queue. This thread will start itself before any of the other threads and contain a continuous loop that can run until it receives a shut down request. The first instruction in the loop is a flag to pause/continue/terminate processing. The flag will be initially set to pause so that the thread sits in an idle state (instead of looping continuously) while there is no processing to be done. The queuing thread will change the flag when there are items in the queue to be processed. This thread will then process a single item in the queue on each iteration of the loop. When the queue is empty it will set the flag back to pause so that on the next iteration of the loop it will wait until the queuing process notifies it that there is more work to be done.
One connection listener thread which listens for incoming connection requests and passes these off to...
A connection processing thread that creates the connection/session. Having a separate thread from your connection listener thread means that you're reducing the potential for missed connection requests due to reduced resources while that thread is processing requests.
An incoming data listener thread that listens for incoming data on the current connection. All data is passed off to a queuing thread to be queued up for processing. Your listener threads should do as little as possible outside of basic listening and passing the data off for processing.
A queuing thread that queues up the data in the right order so everything can be processed correctly, this thread raises the semaphore to the processing queue to let it know there's data to be processed. Having this thread separate from the incoming data listener means that you're less likely to miss incoming data.
Some session object which is passed between methods so that each user's session is self contained throughout the threading model.
This keeps threads down to as simple but as robust a model as I've figured out. I would love to find a simpler model than this, but I've found that if I try and reduce the threading model any further, that I start missing data on the network stream or miss connection requests.
It also assists with TDD (Test Driven Development) such that each thread is processing a single task and is much easier to code tests for. Having hundreds of threads can quickly become a resource allocation nightmare, while having a single thread becomes a maintenance nightmare.
It's far simpler to keep one thread per logical task the same way you would have one method per task in a TDD environment and you can logically separate what each should be doing. It's easier to spot potential problems and far easier to fix them.
What's your platform? If Windows then I'd suggest looking at async operations and thread pools (or I/O Completion Ports directly if you're working at the Win32 API level in C/C++).
The idea is that you have a small number of threads that deal with your I/O and this makes your system capable of scaling to large numbers of concurrent connections because there's no relationship between the number of connections and the number of threads used by the process that is serving them. As expected, .Net insulates you from the details and Win32 doesn't.
The challenge of using async I/O and this style of server is that the processing of client requests becomes a state machine on the server and the data arriving triggers changes of state. Sometimes this takes some getting used to but once you do it's really rather marvellous;)
I've got some free code that demonstrates various server designs in C++ using IOCP here.
If you're using unix or need to be cross platform and you're in C++ then you might want to look at boost ASIO which provides async I/O functionality.
I think the question you should be asking is not if 200 as a general thread number is good or bad, but rather how many of those threads are going to be active.
If only several of them are active at any given moment, while all the others are sleeping or waiting or whatnot, then you're fine. Sleeping threads, in this context, cost you nothing.
However if all of those 200 threads are active, you're going to have your CPU wasting so much time doing thread context switches between all those ~200 threads.