ZeroMQ/Python - CPU affinity hickup?

ZeroMQ/Python - CPU affinity hickup? - multithreading

I have the following strange situation.
We have a process, call it Distributor, that receives tasks over ZeroMQ/TCP from Client, and accumulates them in a queue. There is a Worker process, which talks with the Distributor over ZeroMQ/IPC. The Distributor forwards each incoming task to Worker, and waits for an answer. As soon as the Worker answers, it sends it another task (if there was one received in the mean time), and returns the answer to the Client (over a separate ZeroMQ/TCP connection). If a task was not processed within 10ms, it is dropped from the queue.
With 1 Worker, the system is capable to process ~3,500 requests/sec. The client sends 10,000 requests/sec, so 6,500 requests are dropped.
But - when I'm running some unrelated process on the server, which takes 100% CPU (a busy wait loop, or whatever) - then, strangely, the system can suddenly process ~7,000 requests/sec. When the process is stopped, it returns back to 3,500. The server has 4 cores.
The same happens when running 2, 3 or 4 Workers (connected to the same Distributor), with slightly different numbers.
The Distributor is written in C++. The Worker is written in Python, and uses pyzmq binding. The worker process is a simple arithmetic process, and does not depend on any external I/O other than Distributor.
There is a theory that this has to do with ZeroMQ using threads on separate CPUs when the server is free, and the same CPU when it's busy. If this is the case, I would appreciate an idea how to configure thread/CPU affinity of ZeroMQ so that it works correctly (without running a busy loop in background).
Is there any ZeroMQ setting that might explain / fix this?
EDIT:
This doesn't happen with a Worker written in C++.

This was indeed a CPU affinity problem. Turns out that using ZeroMQ in a setting where a worker processes an input and waits for the next one, if the context switch causes it to switch to another process, a lot of time is wasted on copying the ZeroMQ data.
Running the worker with
taskset -c 1 python worker.py
solves the problem.

Related

Blocking IO-Operation in Single core machine

I am trying to understand what happens to a thread when it is waiting for a http response from remote server.
Let's say at one point in time n processes are running. The OS based on its thread scheduling algorithm will try to run every thread (Let's say on round robin fashion). Let's say one out of n thread has initiated a http request and waiting for response from remote server. Will this thread keep on getting its turn on cpu core? or is there some interrupt sort of mechanism which will notify the thread if it is ready to run? If interrupt sort of mechanism is present, then what is the benefit of using asynchronous programming? at-least from CPU utilization perspective.
Is the above thing language dependent? If yes, what is the difference between java vs nodejs vs python ...

I am trying to understand What happens to a thread when it is waiting
for a http response from remote server.
Well, the thread will wait for the underlying TCP socket to receive data. HTTP is a high level protocol that uses blocking/nonblocking TCP connection. as itself, the thread doesn't wait for an "HTTP response" but rather to some available data for the socket to read.
Will this thread keep on getting its turn on cpu core?
If the thread waits for a TCP socket to be readable, the OS doesn't schedule this thread to run until some data is received. then the OS will schedule the thread to run in some point in the future. blocked thread is never schedule to run - the OS doesn't see the reason to do so, considering the fact that the thread has nothing to do.
Is the above thing dependent on language? if yes what is the
difference between java vs nodejs vs python ...
No. Each OS provides a C/C++ API for application to consume. Windows provides Win32, while Linux provides POSIX. every programming language wraps and binds these APIs and every "high level" call (such as connecting a socket) will eventually call the operating system APIs.

My understanding is asynchronous keyword is used for your program to continue executing instead of waiting for the forked process to complete, even in single core processors, as was the case with early computers we were able to multitask, hence this could be deduced that the resource allocation was done by cpu while trying to be as judicious as it could be, so using async allows your thread of execution to execute without waiting for the blocking task to get completed, otherwise, even though cpu will take turns in executing a thread but since your program is a single thread it will block.

Arangodb CPU Performance differences while running the same application on 2 separate machine

I have a foxx application developed and it is running on machine A. The cpu utilization is usually below 3-4% and sometimes spikes to 20%. I have close to 6 million records.
Same application is deployed on an another machine (exact replica of machine A) and have data of about 100k only, But cpu utilization is at around 200%.
How do I debug this. What is happening on the machine B. Both machines have same application, same arangodb version, same configuration. Disk I/O is also same, memory utilization at machine B is 1/6th of machine A.
Any pointers. This is happening in production enviornment, so its really important for me to debug it quickly.

We were finally able to reproduce such an issue ourselves. We found there was a situation in which a scheduler thread could go into some busy wait state, resulting in the following loop to be executed over and over:
a scheduler thread calling epoll_wait()
epoll_wait() returning instantly, signalling an event for a certain file descriptor
the correct event handling callback being called, but not removing the file descriptor from the list of watched descriptors
goto 1
As the one file descriptor was not properly cleared from the list of watched descriptors, the epoll_wait() always signalled an event for the file descriptor to be available. This made it return almost instantly, and the whole above loop being executed many times per second.
This caused CPU spikes in threads named scheduler.
We found one reason for this to be a client-side connection timing out while the operation triggered by the connection is still executed on the server-side operation. For example, if a client called a server route that took 5 seconds to complete and respond, but the client disconnected after 3 seconds, then it might have happened.
What made it hard to reproduce it that it did not affect all such client connections, but only some - which ones is still unclear.
This particular issue was fixed in ArangoDB 2.6.5, so you may want to give it a try when it is released.

Node.js child process limits

I know that node is a single threaded system and I was wondering if a child process uses its own thread or its parents. say for example I have an amd E-350 cpu with two threads. if I ran a node server that spawned ten child instances which all work continuously. would it allow it or would it fail as the hardware itself is not sufficient enough?

I can say from own experience that I successfully spawned 150 child processes inside an Amazon t2.micro with just one core.
The reason? I was DoS-ing myself for testing my core server's limits.
The attack stayed alive for 8 hours, until I gave up, but it could've been working for much longer.
My code was simply running an HTTP client pool and as soon as one request was done, another one spawned. This doesn't need a lot of CPU. It needs lots of network, though.
Most of the time, the processes were just waiting for requests to finish.
However, in a high-concurrency application, the performance will be awful if you share memory between so many processes.

Why operating system does'nt perform round-robin between the two threads when it only has 1 CPU?

Problem: the operating system does not perform round-robin between the two threads and the system just hangs.
Our system is implemented as a HTTP Native Code Module in c++ for IIS/W2008 R2 64-bit.
Scenario:
Request 1 arrives to the web server; a new thread (t1) is started by IIS. The thread is executing.
Request 2 arrives before request 1 is finished. The IIS starts a new thread (t2). This thread goes into a loop waiting for a shared resource to be available. This behavior is program by us. Since thread two (t2) goes into a loop it starts consuming 100% of the CPU.
Problem: the operating system does not perform round-robin between the two threads and the system just hangs. If it would switch the execution to the first thread, the shared resource would be released and the second thread could run as well.
Even stranger: This behavior does only occur when the machine as 1 CPU. If we add another CPU to the machine it works perfectly, switching between the two threads as expected. Nothings hang.
A workaround (and better programming too) that makes it work when having only 1 CPU is to put a "sleep(100)" in the loop when checking for availability of the shared resource.
Why does not the operating system perform round-robin between the two threads when it only has 1 CPU? Is it related to VMWare?

This thread goes into a loop waiting for a shared resource to be available
This sounds like the wrong way to synchronize things, you will need to do signaling between threads, so that the OS gets a hint to perform the context switch.

Tomcat what is Thread Total Started Count

We are monitoring Tomcat using SNMP tool and its showing me.
Thread Total Started Count = 500 (It's changing frequently)
I have hunt down OID and i found its "jvmThreadTotalStartedCount" http://support.ipmonitor.com/mibs/JVM-MANAGEMENT-MIB/item.aspx?id=jvmThreadTotalStartedCount
It is saying: The total number of threads created and started since the Java Virtual Machine started.
My question is what this means? Could someone explain me in simple/basic language.

A thread is a flow of execution within a process. There are processes that only have a single flow of execution (single-threaded) and others, like Tomcat, which partition their behavior into several flows of execution, in parallel (multi-threaded).
Tomcat, as a web server, typically allocates one thread to handle each request it receives, up to a limit (might be 500 in your case), after which following requests are queued, waiting for a thread to become free to handle them. This is known as thread pooling.
So, to answer your first question, Thread Total Started Count is the total count of all the different flows of execution that have been created by this instance of Tomcat since it started running.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string