Finding latency issues (stalls) in embedded Linux systems - linux

I have an embedded Linux system running on an Atmel AT91SAM9260EK board on which I have two processes running at real-time priority. A manager process periodically "pings" a worker process using POSIX message queues to check the health of the worker process. Usually the round-trip ping takes about 1ms, but very occasionally it takes much longer - about 800ms. There are no other processes that run at a higher priority.
It appears the stall may be related to logging (syslog). If I stop logging the problem seems to go away. However it makes no difference if the log file is on JFFS2 or NFS. No other processes are writing to the "disk" - just syslog.
What tools are available to me to help me track down why these stalls are occurring? I am aware of latencytop and will be using that. Are there some other tools that may be more useful?
Some details:
Kernel version: 2.6.32.8
libc (syslog functions): uClibc 0.9.30.1
syslog: busybox 1.15.2
No swap space configured [added in edit]
root filesystem is on tmpfs (loaded from initramfs) [added in edit]

The problem is (as you said) syslogd. While your process is running at a RT priority, syslogd is not. Additionally, syslogd does not lock its heap and can (and will) be paged out by the kernel, especially with very few 'customers'.
What you could try is:
Start another thread to manage a priority queue, have that thread talk to syslog. Logging would then just be acquiring a lock and inserting something into a list. Given only two subscribers, you should not spend a lot of time acquiring the mutex.
Not using syslog, implement your own logging (basically the first suggestion, minus talking to syslog).
I had a similar problem and my first attempt to fix it was to modify syslogd itself to lock its heap. That was a disaster. I then tried rsyslogd, which improved some but I still got random latency peaks. I ended up just implementing my own logging using a priority queue to help ensure that more critical messages were actually written first.
Note, if you are not using swap (at all), the shortest path to fixing this is probably implementing your own logging.

Related

Unable to locate the memory hog on openvz container

i have a very odd issue on one of my openvz containers. The memory usage reported by top,htop,free and openvz tools seems to be ~4GB out of allocated 10GB.
when i list the processes by memory usage or use ps_mem.py script, i only get ~800MB of memory usage. Similarily, when i browse the process list in htop, i find myself unable to pinpoint the memory hogging offender.
There is definitely a process leaking ram in my container, but even when it hits critical levels and i stop everything in that container (except for ssh, init and shells) i cannot reclaim the ram back. Only restarting the container helps, otherwise the OOM starts kicking in in the container eventually.
I was under the assumption that leaky process releases all its ram when killed, and you can observe its misbehavior via top or similar tools.
If anyone has ever experienced behavior like this, i would be grateful for any hints. The container is running icinga2 (which i suspect for leaking ram) , although at most times the monitoring process sits idle, as it manages to execute all its scheduled checks in more than timely manner - so i'd expect the ram usage to drop at those times. It doesn't though.
I had a similar issue in the past and in the end it was solved by the hosting company where I had my openvz container. I think the best approach would be to open a support ticket to your hoster, explain them the problem and ask them to investigate. Maybe they use some outdated kernel version or they did changes on the server that have impact on your ovz container.

Is it possible to avoid or minimize the use of scheduling policies in operating system design?

I recently stumbled upon the question above but I am not sure if I understand what it is asking.
How would one avoid the use of scheduling policies?
I would think that there isn't any other way...
Scheduling policy has nothing to do with the resource allocation! Processes are scheduled basically, and hence allocated resources as such.
From "Resource allocation (computer)" description on Wikipedia :-
When the user opens any program this will be counted as a process, and
therefore requires the computer to allocate certain resources for it
to be able to run. Such resources could have access to a section of
the computer's memory, data in a device interface buffer, one or more
files, or the required amount of processing power.
I don't know how you got confused between them. All the process would, at a time or another, get scheduled at any point of time; unless the CPU is an unfair one.
EDIT :
How would one avoid the use of scheduling policies?
If there are more than one user-process to be executed, then one has to apply the scheduling policy so that the processes get executed in some order. There has to be a queue to hold all the processes. See a different case in BareMetal OS below.
Then, there is BareMetal OS which is single address space OS.
Multitasking on BareMetal is unusual for operating systems in this day
and age. BareMetal uses an internal work queue that all CPU cores
poll. A task added to the work queue will be processed by any
available CPU core in the system and will execute until completion,
which results in no context switch overhead.
So, BareMetal OS doesn't use any scheduling policy, it is based on polling of the work-queue by the cores.

Shorting the time between process crash and shooting server in the head?

I have a routine that crashes linux and force a reboot using a system function.
Now I have the problem that I need to crash linux when a certain process dies. Using a script starting the process and if the script ends restart the server is not appropriate since it takes some ms.
Another idea is spawning the shooting processes alongside and use polling of a counter and if the counter is not incremented reboot the server would be another idea.
This would result in an almost instant reaction.
Now the question is what would be a good timeframe. I have no idea how the scheduler of linux would guarantee a certain update of any such counter and what a good timeout would be.
Also I would like to hear some alternatives to this second process spawning. Is there a possibility to advice linux to run a certain routine in case of a crash of the given process or a listener meachanism for the even of problems with a given process?
The timeout idea is already implemented in the kernel. You can register any application as a software watchdog, but you'll have to lower the default timeout. Have a look at http://linux.die.net/man/8/watchdog for some ideas. That application can also handle user-defined tests. Realistically unless you're trying to run kernel like linux-rt, having timeouts lower than 100ms can be dangerous on a system with heavy load - especially if the check needs to poll another application.
In cases of application crashes, you can handle them if your init supports notifications. For example both upstart and systemd can do that by monitoring files (make sure coredumps are created in the right place).
But in any case, I'd suggest rethinking the idea of millisecond-resolution restarts. Do you really need to kill the system in that time, or do you just need to isolate it? Just syncing the disks will take a few extra milliseconds and you probably don't want to miss that step. Instead of just killing the host, you could make sure the affected app isn't working (SIGABRT?) and kill all networking (flush iptables, change default to DROP).

Debugging utilities for Linux process hang issues?

I have a daemon process which does the configuration management. all the other processes should interact with this daemon for their functioning. But when I execute a large action, after few hours the daemon process is unresponsive for 2 to 3 hours. And After 2- 3 hours it is working normally.
Debugging utilities for Linux process hang issues?
How to get at what point the linux process hangs?
strace can show the last system calls and their result
lsof can show open files
the system log can be very effective when log messages are written to track progress. Allows to box the problem in smaller areas. Also correlate log messages to other messages from other systems, this often turns up interesting results
wireshark if the apps use sockets to make the wire chatter visible.
ps ax + top can show if your app is in a busy loop, i.e. running all the time, sleeping or blocked in IO, consuming CPU, using memory.
Each of these may give a little bit of information which together build up a picture of the issue.
When using gdb, it might be useful to trigger a core dump when the app is blocked. Then you have a static snapshot which you can analyze using post mortem debugging at your leisure. You can have these triggered by a script. The you quickly build up a set of snapshots which can be used to test your theories.
One option is to use gdb and use the attach command in order to attach to a running process. You will need to load a file containing the symbols of the executable in question (using the file command)
There are a number of different ways to do:
Listening on a UNIX domain socket, to handle status requests. An external application can then inquire as to whether the application is still ok. If it gets no response within some timeout period, then it can be assumed that the application being queried has deadlocked or is dead.
Periodically touching a file with a preselected path. An external application can look a the timestamp for the file, and if it is stale, then it can assume that the appliation is dead or deadlocked.
You can use the alarm syscall repeatedly, having the signal terminate the process (use sigaction accordingly). As long as you keep calling alarm (i.e. as long as your program is running) it will keep running. Once you don't, the signal will fire.
You can seamlessly restart your process as it dies with fork and waitpid as described in this answer. It does not cost any significant resources, since the OS will share the memory pages.

Are message queues obsolete in linux?

I've been playing with message queues (System V, but POSIX should be ok too) in Linux recently and they seem perfect for my application, but after reading The Art of Unix Programming I'm not sure if they are really a good choice.
http://www.faqs.org/docs/artu/ch07s02.html#id2922148
The upper, message-passing layer of System V IPC has largely fallen out of use. The lower layer, which consists of shared memory and semaphores, still has significant applications under circumstances in which one needs to do mutual-exclusion locking and some global data sharing among processes running on the same machine. These System V shared memory facilities evolved into the POSIX shared-memory API, supported under Linux, the BSDs, MacOS X and Windows, but not classic MacOS.
http://www.faqs.org/docs/artu/ch07s03.html#id2923376
The System V IPC facilities are present in Linux and other modern Unixes. However, as they are a legacy feature, they are not exercised very often. The Linux version is still known to have bugs as of mid-2003. Nobody seems to care enough to fix them.
Are the System V message queues still buggy in more recent Linux versions? I'm not sure if the author means that POSIX message queues should be ok?
It seems that sockets are the preferred IPC for almost anything(?), but I cannot see how it would be very simple to implement message queues with sockets or something else. Or am I thinking too complexly?
I don't know if it's relevant that I'm working with embedded Linux?
Personally I am quite fond of message queues and think they are arguably the most under-utilized IPC in the unix world. They are fast and easy to use.
A couple of thoughts:
Some of this is just fashion. Old things become new again. Add a shiny do-dad on message queues and they may be next year's newest and hottest thing. Look at Google's Chrome using separate processes instead of threads for its tabs. Suddenly people are thrilled that when one tab locks up it doesn't bring down the entire browser.
Shared memory has something of a He-man halo about it. You're not a "real" programmer if you aren't squeezing that last cycle out of the machine and MQs are marginally less efficient. For many, if not most apps, it is utter nonsense but sometimes it is hard to break a mindset once it takes hold.
MQs really aren't appropriate for applications with unbounded data. Stream oriented mechanisms like pipes or sockets are just easier to use for that.
The System V variants really have fallen out of favor. As a general rule go with POSIX versions of IPC when you can.
Yes, I think that message queues are appropriate for some applications. POSIX message queues provide a nicer interface, in particular, you get to give your queues names rather than IDs, which is very useful for fault diagnosis (makes it easier to see which is which).
Linux allows you to mount the posix message queues as a filesystem and see them with "ls", delete them with "rm" which is quite handy too (System V depends on the clunky "ipcs" and "ipcrm" commands)
I haven't actually used POSIX message queues because I always want to leave open the option to distribute my messages across a network. With that in mind, you might look at a more robust message-passing interface like zeromq or something that implements AMQP.
One of the nice things about 0mq is that when used from the same process space in a multithreaded app, it uses a lockless zero-copy mechanism that is quite fast. Still, you can use the same interface to pass messages over a network as well.
Biggest disadvantages of POSIX message queue:
POSIX message queue does not make it a requirement to be compatible with select().(It works with select() in Linux but not in Qnx system)
It has surprises.
Unix Datagram socket does the same task of POSIX message queue. And Unix Datagram socket works in socket layer. It is possible to use it with select()/poll() or other IO-wait methods. Using select()/poll() has the advantage when designing event-based system. It is possible to avoid busy loop in that way.
There is surprise in message queue. Think about mq_notify(). It is used to get receive-event. It sounds like we can notify something about the message queue. But it is actually registering for notification instead of notifying anything.
More surprise about mq_notify() is that it has to be called after every mq_receive(), which may cause a race-condition(when some other process/thread call mq_send() between the call of mq_receive() and mq_notify()).
And it has a whole set of mq_open, mq_send(), mq_receive() and mq_close() with their own definition which is redundant and in some case inconsistent with socket open(),send(),recv() and close() method specification.
I do not think message queue should be used for synchronization. eventfd and signalfd are suitable for that.
But it(POSIX message queue) has some realtime support. It has priority features.
Messages are placed on the queue in decreasing order of priority, with newer messages of the same priority being placed after older messages with the same priority.
But this priority is also available for socket as out-of-band data !
Finally, to me , POSIX message queue is a legacy API. I always prefer Unix Datagram socket instead of POSIX message queue as long as the real-time features are not needed.
Message queues are very useful to build local decoupled applications. They are super fast, they are block organized (no need for buffering, cutting, etc which is the case for streaming sockets), basically few memcpy() operations (user code copy block to kernel, and kernel copy block to other process reading from q), and that's the story for message delivery. Some industry known middlewares such as Oracle Tuxedo or Mavimax Enduro/X uses these queues to help to build load balanced, high performance, fault tolerant decomposed, distributed applications. These queues allows to do load balancing, when several executables reads from the same queue, and kernel scheduler just distributes the message to processes which ever is idling. The nice thing for Linux is that poll can be done on Posix queues, which helps a to solve certain scenarios. For IBM AIX it is possible to do poll on System V queues.
For example, two processes can communicate easily locally over the queues with quite impressive throughput (~70k req+rply/sec):
If networking is needed, then for example Enduro/X provides tpbridge process which basically reads from messages from local queue, sends blocks to some other machine, where the other end injects the messages back in the local queue.
Also when comparing to sockets, you do not get any issues with queues, such as busy/lingering sockets when for example some binary have crashed, i.e. program at startup can immediately start to read the queues and do the processing.

Resources