On Linux, is it possible to record the running processes (just which process is running when), for some period of time? It would be like getting a log from top. The reason I want to do that is that I have performance issues with my process, and the box I am working on does not provide any facility to analyze which processes are running and when. More specifically, my process has a response time of anywhere between 3.9s and 3.2s, but this is not random: there are periods of time at 3.9s and periods at 3.2s. Having ruled out blocking I/O, we want to see if some other process is running during the periods at 3.9s.
Related
I have an embedded system in which there are multiple users processes which run simultaneously as they are interdependent they communicate via posix queue. The issue is that one of the process is taking a bit more time to complete a task (I don't know which process or which section of code) cause of which the other process gets delayed to complete its task.
How can I figure this out that which process is taking more time and in which section of code? The system is a measuring device so it cannot have any delay or spikes in the timing of processing. I tried changing the data rate of the entire system but does not help as the spikes still appears.
Is there any possibility in linux to bind a system call when the process scheduled in the same section of code and reached a certain threshold of the scheduling duration?
Suppose I have a multi-core laptop.
I write some code in python, and run it;
then while my python code is running, I open my matlab and run some other code.
What is going on underneath? Will this two process be processed in parallel using multi-core auomatically?
Or the computer waits for one to finish and then process the other?
Thank you !
P.S. The two programs I am referring to can be considered the simplest in nature, e.g. calculate 1+2+3.....+10000000
The answer is... it depends!
Your operating system is constantly switching which processes are running. There are tons of processes always running in the background - refreshing the screen, posting sound to the speakers, checking for updates, polling the mouse, etc. - and those processes can only actually execute if they get some amount of processor time. If you have many cores, the OS will use some sort of heuristics to figure out which processes should get some time on the cores. You have the illusion that everything is running at the same time because (1) in some sense, things are running at the same time because you have multiple cores, and (2) the switching happens so fast that you can't notice it happen.
The reason I'm bringing this up is that if you run both Python and MATLAB at the same time, while in principle they could easily run at the same time, it's not guaranteed that that happens because you may have a ton of other things going on as well. It might be that both Python and MATLAB run for a bit concurrently, then both temporarily get paused to allow some program that's playing music to load the next sound clip to be played into memory, then one pauses while the OS pages in some memory from disk and another takes over, etc.
Can you assume that they'll run in parallel? Sure! Most reasonable OSes will figure that out and do it correct. Can you assume that they exclusively are running in parallel and nothing else is? Not necessarily.
I have a Linux embedded system built with PREEMPT_RT (real time patch) that creates multiple SCHED_FIFO threads, with a priority of 90 each. The goal is that they execute without being preempted, but with the same priority.
Each thread does a little bit of work, then goes to sleep using std::this_thread::sleep_for() for a few milliseconds, then gets scheduled back and executes the same amount of work.
Most of the time, each thread latency is impeccable, but once every minute or so (not an exact regular interval) all threads get hogged at the same time for one second or more (instead of the low milliseconds they usually get called at).
I have made sure Power management is disabled in the kernel kconfig, I have called mlockall() to avoid memory getting paged out, to no avail.
I have tried to use ftrace with wakeup_rt as the tracer, but the highest latency recorded was around 5ms, not nearly enough time to be the cause of the issue.
I am not sure what tool would be best to identify where the latency is coming from. Does anyone have ideas please?
I have a matlab processing script located in the middle of a long processing pipeline running on linux.
The matlab script applies the same operation to a number N of datasets D_i (i=1,2,...,N) in parallel on (8 cores) via parfor.
Usually, processing the whole dataset takes about 2hours (on 8 cores).
Unfortunately, from time to time, looks like one of the matlab subprocesses crashes randomly. This makes the job impossible to complete (and the pipeline can't finish).
I am sure this does not depend on the data as if I reprocess specifically the D_i on which the process crashes, it is executed without problems. Moreover, up to now I've processed already thousands of the mentioned dataset.
How I deal with the problem now (...manually...):
After I start the matlab job, I periodically check the process list on the machine (via a simple top); whenever I have one matlab process alive after two hours of work, then I know for sure that it has crashed. Then I simply kill it and process the part of the dataset which has not been analyzed.
Question:
I am looking for suggestion on how to timeout ALL the matlab processes running and kill them whenever they are alive for more than e.g. 2hrs CPU.
You should be able to do this by restructuring your code to use PARFEVAL instead of PARFOR. There's a simple example in this entry on Loren's blog: http://blogs.mathworks.com/loren/2013/12/09/getting-data-from-a-web-api-in-parallel/ which shows how you can stop waiting for work after a given amount of time.
I have a Java app running on Linux Mint. EVERY minute, the program shows a very noticeable slow down -- A pause. The pause is a consistent 3 to 4 seconds. When we run further instances of the same program, they also pause 3 to 4 seconds each minute. Each program stops on a different second of the minute.
latest update:
After the last update (below) increasing the thread pool's thread count saw the GUI problem go away. After running for around ~40 hours we observed a thread leak in the Jetty HttpClient blocking-GET (Request.send()) call. To explain the mechanics, using the Executor class: a main thread runs every few minutes. It uses Executor to run an independent thread to call the host with a HTTP GET command, Jetty's HttpClient.request.send().
After about 40 hours of operation, there was a spike on the number of threads running in the HttpClient pool. So for 40 hours, the same threads ran fine. The working hypothesis is that around that time, one or more send() calls did not complete or time-out and have not returned to the calling thread. Essentially this/these threads are hung inside the Jetty Client.
When watching each regular cycle in jVisualMV we see the normal behaviour each cycle; some HttpClient threads fire up for the host GET, execute and go-away in just a few seconds. Also on the monitor are about 10 thread belonging to the Jetty HttpClient thread pool that have been 'present' for (now) 10 hours.
The expectation is that there was some error in underlying client or network processing. I am surprised there was no time-out exception or programming exception. There are some clear question I can ask now.
What can happen inside HttpClient that could just hang a Request.send()
What is the time-out on the call return? I would think there will still be absolute time-outs or checks for locking, etc (no?)
Can the I/O system hang and leave the caller-thread hanging -- While Java obediently ...
Fires the manager thread at the scheduled time, then
The next Http.Request.send() happens,
A new thread(s) from the pool run-up for the next send (as appears to have happened).
While the earlier send() is stuck in limbo
Can I limit or other wise put a clean-up on these stuck threads?
This was happening before we increased the thread pool size. What's happened is that the 'blame' has become more focused on the problem area. also we are suspicious of the underlying system because we also had lock-ups with Apache HttpClient again around the same (non-specific) time of day.
(prior update) ...
The pause behaviour observed is the JavaFX GUI does not update/refresh; the display's clock (textView), setText() call was logged during the freeze with two x updates per second (that's new information). The clock doesn't update (on Mint Linux), it continues to update when running on Windows. To forestall me repeating myself for questions about GC, logs, probes, etc. the answer will be the same; we have run extensive diagnostics over weeks now. The issue is unmistakably a mix of: Linux JVM / Linux Mint / Threads (per JavaFX). Other piece of new data is that increasing the thread-pool count by +2, appears to remove the freeze -- Further testing is needed to confirm that and tune the numbers. The question though is "What are the parameters that make the difference between the two platforms?"
We have run several instances of the program on Windows for days with no pauses. When we run on a Mint Linux platform we see the freeze, it is very consistent.
The program has several running threads running on a schedule. One thread opens the internet for an http socket. When we comment out that area, the pause vanishes. However we don't see that behaviour using Windows. Experiments point to something specific to the Mint networking I/O subsystem, linux scheduling, the Linux Java 8 JVM, or some interaction between the two.
As you may guess, we are tearing our hair out on this one. For example, we turned off logging and the pause remained. We resumed logging and just did one call to the http server, pause every 60 seconds, on the same second count. This happens even when we do no other processing. We tried different http libraries, etc. Seems very clear it is in the JVM or Linux.
Does anyone know of a way to resolve this?