So my requirements state that the system should be able to process 10000 requests in 1 hour. So that means i should run a tests that sees how fast it can process 10k requests(not concurent users). I have never had to do such a thing as my requierements were aways in concurent users. Can someone tell me what thread group to use and how it should be set up to do this particular task. Again the endgame is not to reach 10k concurent users but to run 10k requests and see how fast they can complete.
You need to finish in 3600 seconds 10000 requests.
Assuming that ( last) request takes maximun 100 seconds you should execute with Thread group defined with maximum [ramp up][1] = 3500 while Number of Threads = 10000
Ramp-up needs to be long enough to avoid too large a work-load at the start of a test, and short enough that the last threads start running before the first ones finish
I suggest start with Ramp up = 3500 and decrease number until requests will start to failed.
The ramp-up period tells JMeter how long to take to "ramp-up" to the full number of threads chosen.
Another option is using JMeter plugin Concurrency Thread Group
simplified approach for configuring threads schedule. It is intended to maintain the level of concurrency, which means starting additional during the runtime threads if there's not enough of them running in parallel. Unlike standard Thread Group, it won't create all the threads upfront, so extra memory won't be used. It's a good repacement for Stepping Thread Group, since it allows threads to finish their job gracefully.
In your case not concurrent user for an hour:
Target Concurrency = 1 (not concurrent)
Ramp up Time - 60 (minutes)
Ramp up Steps Count 1
Hold Target Rate Time 60 (minutes)
Thread Iteration Limit 1 (not concurrent)
10 000 requests in one hour is something like 166.6 requests per minute or 2.7 requests per second.
You can use Constant Throughput Timer or Throughput Shaping Timer in order to limit JMeter's throughput to the given numbers and see whether it will be able to complete 10k requests in 1 hour or not.
The test run may be shorter as if you see throughput metric going beyond 2.7 requests per second - most likely the system under test will not be able to cope with the load.
Once you will get confidence that your system can handle 10k requests in one hour you can increase the throughput to see if/where/when it breaks.
Related
I am confused of the concept of millicores in Kubernetes . As per my programming knowledge , only 1 thread can run per core so why would give a limit in millicores ?
For example if i give a cpu limit of 600m to a container , can i use 400m for another pod or container , is it possible ?
I have tried installing minikube and ran on it .
Will both containers or pods run different threads ? Please if anyone can explain.
It's best to see millicores as a way to express fractions, x millicores correspond to the fraction x/1000 (e.g. 250millicores = 250/1000 = 1/4).
The value 1 represent the complete usage of 1 core (or hardware thread if hyperthreading or any other SMT is enabled).
So 100mcpu means the process is using 1/10th of a single CPU time. This means that it is using 1 second out of 10, or 100ms out of a second or 10us out of 100.
Just take any unit of time, divide it into ten parts, the process is running only for one of them.
Of course, if you take a too short interval (say, 1us), the overhead of the scheduler becomes non-negligeable but that's not important.
If the value is above 1, then the process is using more than one CPU. A value of 2300mcpu means that out of, say, 10 seconds, the process is running for... 23!
This is used to mean that the process is using 2 whole CPUs and a 3/10 of a third one.
This may sound weird but it's no different to saying: "I work out 3.5 times a week" to mean that "I work out 7 days every 2 weeks".
Remember: millicores represent a fraction of CPU time not of CPU number. So 2300mcpu is 230% the time of a single CPU.
What I hate about technologies like Kubernetes and Docker is that they hide too much, confusing seasoned programmers.
The millicores unit arises, at its base, from the way the Linux scheduler works. It doesn't divide the time into quanta and assigns each thread the CPU for a quantum, instead, it runs a thread until it's unfair to keep it running. So a thread can run for a variable time.
The current Linux scheduler, named CFS, works with the concept of waiting time.
Each thread has a waiting time, a counter that is incremented each nanosecond (but any sufficiently fine unit of time will do) that the thread is waiting to execute and that is decremented each nanosecond the thread is executing.
The threads are then ordered by their wait time divided the total number of threads, the thread with the greatest wait time is picked up and run until its wait time (that now is decreasing) falls below the wait time of another thread (which will be then scheduled).
So if we have one core (without HyperThreading or any other SMT) and four threads, after, say, a second, the scheduler will have allocated 1/4 of that second (250ms) to each thread.
You can say that each thread used 250millicores. This means it uses 250/1000 = 1/4 of the core time on average. The "core time" can be any amount of time, granted it is far greater than the scheduler wallclock. So 250millicores means 1 minute of time every 4, or 2 days every 8.
When a system has multiple CPUs/cores, the waiting time is scaled to account for that.
Now if a thread is scheduled, over the course of 1 second, to two CPUs for the whole second, we have an usage of 1/1 for the first CPU and 1/1 for the second one. A total of 1/1 + 1/1 = 2 or 2000mcpu.
This way of counting CPU times, albeit weird at first, at the advantage that it is absolute. 100mcpu means 1/10 of a CPU, no matter how many CPUs there are, this is by design.
If we counted time in a relative matter (i.e. where the value 1 means all the CPUs) then a value like 0.5 would mean 24 CPUs in a 48 CPUs system and 4 in an 8 CPUs system.
It would be hard to compare timings.
The Linux scheduler doesn't actually know about millicores, as we have seen it uses the waiting time and doesn't need any other measurement unit.
That millicores unit is just a unit we make up, so far, for our convenience.
However, it will turn out this unit will arise naturally due to how containers are constrained.
As implied by its name, the Linux scheduler is fair: all threads are equals. But you don't always want that, a process in a container should not hog all the cores on a machine.
This is where cgroups comes into play. It is a kernel feature that is used, along with namespace and union fs, to implement containers.
Its main goal is to restrict processes, including their CPU bandwidth.
This is done with two parameters, a period and a quota.
The restricted thread is allowed, by the scheduler, to run for quota microseconds (us) every period us.
Here, again, a quota greater than the period means using more than one CPU. Quoting the kernel documentation:
Limit a group to 1 CPU worth of runtime.
If period is 250ms and quota is also 250ms, the group will get
1 CPU worth of runtime every 250ms.
Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
runtime every 500ms.
We see how, given x millicores, we can compute the quota and the period.
We can fix the period to 100ms and the quota to (100 * x) / 1000.
This is how Docker does it.
Of course, we have an infinite choice of pairs, we set the period to 100ms but indeed we can use any value (actually, there aren't infinite value but still).
Larger values of the period mean the thread can run for a longer time but will also pause for a longer time.
Here is where Docker is hiding things from the programmer, using an arbitrary value for the period in order to compute the quota (given the millicores, which the authors dub as more "user-friendly").
Kubernetes is designed around Docker (yes, it can use other container managers but they must expose an interface similar to the Docker's one), and the Kubernetes millicores unit match the unit used by Docker in its --cpus parameter.
So, long story short, millicores are the fractions of time of a single CPU (not the fraction of number of CPUs).
Cgroups, and hence Docker, and hence Kubernetes, doesn't restrict CPU usage by assigning cores to processes (like VMs do), instead it restricts CPU usage by restricting the amount of time (quota over period) the process can run on each CPU (with each CPU taking up to 1000mcpus worth of allowed time).
The scheduler of the kernel running the containers (f.e. linux) has means reserve time slices for an process to run concurrently with other processes on the same cpu.
You can throttle a process - giving it less time slices - if it uses too much cpu. This happens then a (hard) limit is hit. You can schedule a pod to a different node, if the cpu requests exceed the available cpu resources on a node.
So the requests is a hint for the kubernetes scheduler how to optimally place pods across nodes and the limit is to ensure by the kernel scheduler that no more resources will actually be used.
Actually if you just configure requests and no limits, all pods will be scheduled by the kernel scheduler policy, which is trying to be fair and balance the resources across all processes to maximize the usage while not starving any single process.
I've got a program that has about 80 threads. It's running on a ~50ish core machine on linux 3.36. At most there are 2 of these programs running at once, and they are identical. Nothing else is running on the machine.
The threads themselves are real-time linux pthreads with SCHED_RR (round robin) policy.
10 are highest priority (yes, I set ulimit to 99) and have cpu affinity set to 10 of the cores. In other words, they are each pinned to their own core.
about 60 are medium priority.
about 10 are low priority.
The 10 highest priority threads are constantly using cpu.
The rest are doing network IO as well as doing some work on the CPU. Here's the problem: I'm seeing one of the low priority threads being starved, sometimes over 15 seconds at a time. This specific thread is waiting on a TCP socket for some data. I know the data has been fully sent because I can see that the server on the other end of the connection has sent the data (i.e., it logs a timestamp after sending the data). Usually the thread takes milliseconds to receive and process it, but sporadically it will take 15 seconds after the other server has successfully sent the data. Note that increasing the priority of the thread and pinning it to a CPU has eradicated this issue, but this is not a long-term solution. I would not expect this behavior in the first place - 15 seconds is a very long time.
Does anyone know why this would be happening? We have ruled out that it is any of the logic in the program/threads. Also note that the program is written in C.
I would not expect this behavior in the first place - 15 seconds is a very long time.
If your 60 medium-priority threads were all runnable, then that's exactly what you'd expect: with realtime threads then lower-priority threads won't run at all while there's higher-priority threads still runnable.
You might be able to use perf timechart to analyse exactly what's going on.
a) I have a task which I want the server to do every X hours for every user (~5000 users). Is it better to:
1 - Create a worker thread for each user that does the task and sleep for X hours then start again, where each task is running in random time (so that most tasks are sleeping at every moment)
2 - Create one Thread that loops through the users and do the task for each user then start again (even if this takes more than X hours).
b) if plan 1 is used, do sleeping threads aftect the performance of the server?
c) If the answer is yes, do the sleeping thread have the same effect as the thread that is doing the task?
Note that this server is not only used for this task. It is used for all the communications with the ~5000 clients.
Sleeping threads generally do not affect CPU usage. They consume 1MB of stack memory each. This is not a big deal for dozens of threads. It is a big deal for 5000 threads.
Have one thread or timer dedicated to triggering the hourly work. Once per hour you can process the users. You can use parallelism if you want. Process the users using Parallel.ForEach or any other technique you like.
Whether you choose a thread or timer doesn't matter for CPU usage in any meaningful way. Do what fits your use app most.
There is not enough details about your issue for a complete answer. However, based on the information you provided, I would:
create a timer (threading.timer)
set an interval which will be the time interval between processing a "batch" of 5'000 users
Then say the method/task you want to perform is called UpdateUsers:
when timer "ticks", in the UpdateUsers method (callback):
1. stop timer
2. loop and perform task for each user 3. start timer
This way you ensure that the task is performed for each user and there is no overlapping if it takes more than X hours total. The updates will happen every Y time, where Y is the time interval you set for your timer. Also, this uses maximum one thread, depending on how your server/service is coded.
Actually I am trying to run some experiments where i need to run benchmarks under heavy load. Starting from CPU load, I schedule a sysbench daemon that generates 1000 primes. I set its priority to low so that it only runs once the cpu is not busy with other tasks so as to reduce its impact on the regular workload. Since the priority of the process is set to Low, the process keeps waiting in the queue until it finds a free cpu core to run on. The problem is that its result shows the execution time including the wait period (in the queue) which renders the result invalid.
Is there some way that I could actually calculate the wait period and subtract it from the result to get a valid result?
Given a machine with 1 CPU and a lot of RAM. Besides other kinds of applications (web server etc.), there are 2 other server applications running on that machine doing the exact same kind of processing although one uses 10 threads and the other users 1 thread. Assume the processing logic for each request is 100% CPU-bound and typically takes no longer than 2 seconds to finish. The question is whose throughput, in terms of transactions processed per minute, might be better? Why?
Note that the above is not a real environment, I just make up the data to make the question clear. My current thinking is that there should be no difference because the apps are 100% CPU-bound and therefore if the machine can handle 30 requests per minute for the 2nd app, it will also be able to handle 3 requests per minute for each of the 10 threads of the 1st app. But I'm glad to be proven wrong, given the fact that there are other applications running in the machine and one application might not be always given 100% CPU time.
There's always some overhead involved in task switching, so if the threads aren't blocking on anything, fewer threads is generally better. Also, if the threads aren't executing the same part of code, you'll get some cache flushing each time you swtich.
On the other hand, the difference might not be measurable.
Interesting question.
I wrote a sample program that does just this. It has a class that will go do some processor intensive work, then return. I specify the total number of threads I want to run, and the total number of times I want the work to run. The program will then equally divide the work between all the threads (if there's only one thread, it just gets it all) and start them all up.
I ran this on a single proc VM since I could find a real computer with only 1 processor in it anymore.
Run independently:
1 Thread 5000 Work Units - 50.4365sec
10 Threads 5000 Work Units - 49.7762sec
This seems to show that on a one proc PC, with lots of threads that are doing processor intensive work, windows is smart enough not to rapidly switch them back and fourth, and they take about the same amount of time.
Run together (or as close as I could get to pushing enter at the same time):
1 Thread 5000 Work Units - 99.5112sec
10 Threads 5000 Work Units - 56.8777sec
This is the meat of the question. When you run 10 threads + 1 thread, they all seem to be scheduled equally. The 10 threads each took 1/10th longer (because there was an 11th thread running) while the other thread took almost twice its time (really, it got 1/10th of its work done in the first 56sec, then did the other 9/10ths in the next 43sec...which is about right).
The result: Window's scheduler is fair on a thread level, but not on a process level. If you make a lot of threads, it you can leave the other processes that weren't smart enought to make lots of threads high and dry. Or just do it right and us a thread pool :-)
If you're interested in trying it for yourself, you can find my code:
http://teeks99.com/ThreadWorkTest.zip
The scheduling overhead could make the app with 10 threads slower than the one with 1 thread. You won't know for sure unless you create a test.
For some background on multithreading see http://en.wikipedia.org/wiki/Thread_(computer_science)
This might very well depend on the operating system scheduler. For example, back in single-thread days the scheduler knew only about processes, and had measures like "niceness" to figure out how much to allocate.
In multithreaded code, there is probably a way in which one process that has 100 threads doesn't get 99% of the CPU time if there's another process that has a single thread. On the other hand, if you have only two processes and one of them is multithreaded I would suspect that the OS may give it more overall time. However, AFAIK nothing is really guaranteed.
Switching costs between threads in the same process may be cheaper than switching between processes (e.g., due to cache behavior).
One thing you must consider is wait time on the other end of the transaction. Having multiple threads will allow you to be waiting for a response on one while preparing the next transaction on the next. At least that's how I understand it. So I think a few threads will turn out better than one.
On the other hand you must consider the overhead involved with dealing on multiple threads. The details of the application are important part of the consideration here.