Best strategy to check harddisk issues on Red Had Linux - linux

What is the best strategy(and utility) to check hard disk related issues on red hat Linux ?

Depends on the issue. If it's some sort of failure-related thing, check dmesg for error messages and smartctl for info on what might have gone wrong. If it's performance problems, check smartctl for remapped blocks and sar -d 1 0 to see how hard the disk is actually being thrashed.
Beyond that, what issues are you actually having? More detailed questions elicit more detailed answers.

Related

Disk inexplicably filled

I have two linux machines which should be near enough identical clones of each other. One of them has 89% useage of /dev/sda1, and the other has 27% useage.
I've tried the rather manual process of du -h in the root file system and comparing the two, but there are no substantial discreneable differences. Is there any other way to find out where the missing 20GB are?
Thanks!
Problem solved, there was an issue with an unmounted drive which caused it :)
ncdu will display the size of each directory from an ncurses interface. Probably what you're looking for.
Try to look the information provided with the command
tune2fs -l /dev/sda1
Do you see any difference in the block size or anything else?
You can also try baobab -disk usage analyzer-, a gui tool which displays disk usage in a clever visualization of nested piecharts.

skb allocation failures in 2.6.32

We are running CentOS 6.3 (based on 2.6.32) and under high load we receive order 0 allocation failures when allocating skb. This problem is not inspected on CentOS 5.4 (based on 2.6.18).
This problem is very similar to https://bugzilla.kernel.org/show_bug.cgi?id=14141.
The bug 14141 is closed by some patch (related to tty), that I don't sure is related to the problem.
Here is a very long discussion in lkml regarding the bug (see https://lkml.org/lkml/2009/10/2/86), which ends by some possible leads regarding congestion_wait() and more, but without patch.
I saw some proposals to inrcease vm.min_free_kbytes. IMHO it's not a solution, but rather workaround.
I suppose the problem was solved, but can't find when and how.
Any ideas?
Thanks

When the linux kernel reports time as spent servicing "soft interrupts", precisely what does it mean?

I have top showing high activity in the %si column for several cpus. Sar reports something similar.
Q1) What are the possibilities for what might be executing in the kernel? It appears that "softirqs" themselves are more or less outdated, and generally function as a mechanism to implement other interfaces, including tasklets, rculists, and I'm not sure what else. I'd like to get a comprehensive list.
Q2) How can I get more precise information on what's actually running as "soft interrupts" on my test system?
As it happens, I have strong suspicions that a particular device driver is involved, since the hard interrupt % is also high, and only on whatever cpu that's presently handling this particular device's interrupt :-) But I haven't so far found anything in the driver source code that looks to me as if it could result in softirq activity. I'm probably missing the obvious, so I'm asking for help ;-)
My kernel is rather outdated - 2.6.32 based (RHEL 6.1, I believe) but I doubt that matters too much for questions this general.

Disk failure detection perl script

I need to write a script to check the disk every minute and report if it is failing by any reason. The error could be the absolute disk failure and a bad sector and so on .
First, I wonder if there is any script out there that does the same as it should be a standard procedure (because I really do not want to reinvent the wheel).
Second, I wonder if I want to look for errors in /var/log/messages, is there any list of standard error strings for disks that I can use?
I look for that on the net a lot, there are lots of info and at the same time no info about that.
Any help will be much appreciated.
Thanks,
You could simply parse the output of dmesg which usually reports fairly detailed information about drive errors, well that's how I've collected stats on failing drives before.
You might get better more well documented information by using Parse::Syslog or lower level kernel reporting directly though.
Logwatch does the /var/log/messages part of the ordeal (as well as any other logfiles that you choose to add). You can either choose to use that, or to use its code to roll your own sollution (it's all written in perl).
If your harddrives support SMART, i suggest you use smartctl output for diagnostics as it includes a lot of nice info that can be monitored over time to detect failure.

Using "top" in Linux as semi-permanent instrumentation

I'm trying to find the best way to use 'top' as semi-permanent instrumentation in the development of a box running embedded Linux. (The instrumentation will be removed from the final-test and production releases.)
My first pass is to simply add this to init.d:
top -b -d 15 >/tmp/toploop.out &
This runs top in "batch" mode every 15 seconds. Let's assume that /tmp has plenty of spaceā€¦
Questions:
Is 15 seconds a good value to choose for general-purpose monitoring?
Other than disk space, how seriously is this perturbing the state of the system?
What other (perhaps better) tools could be used like this?
Look at collectd. It's a very light weight system monitoring framework coded for performance.
We use sysstat to monitor things like this.
You might find that vmstat and iostat with a delay and no repeat counter is a better option.
I suspect 15 seconds would be more than adequate unless you actually want to watch what's happening in real time, but that doesn't appear to be the case here.
As far as load, on an idling PIII 900Mhz w/ 768MB of RAM running Ubuntu (not sure which version, but not more than a year old) I have top updating every 0.5 seconds and it's about 2% CPU utilization. At 15s updates, I'm seeing 0.1% CPU utilization.
depending upon what exactly you want, you could use the output of uptime, free, and ps to get most, if not all, of top's information.
If you are looking for overall load, uptime is probably sufficient. However, if you want specific information about processes, you are adventurous, and have the /proc filessystem enabled, you may want to write your own tools. The primary benefit in this environment is that you can focus on exactly what you want and minimize the load introduced to the system.
The proc file system gives your application read access to the kernel memory that keeps track of many of the interesting variables. Reading from /proc is one of the lightest ways to get this information. Additionally, you may be able to get more information than provided by top. I've done this in the past to get amount of time spent in user and system by this process. Additionally, you can use this to get information about the number of file descriptors open by the process. You might also use this to get detailed information about how the network system is working.
Much of this information is pre-processed by other applications which can be used if you get the information you need. However, it is rather straight-forward to read the raw information. Do a man proc for more information.
Pity you haven't said what you are monitoring for.
You should decide whether 15 seconds is ok or not. Feel free to drop it way lower if you wish (and have a fast HDD)
No worries unless you are running a soft real-time system
Have a look at tools suggested in other answers. I'll add another sugestion: "iotop", for answering a "who is thrashing the HDD" questions.
At work for system monitoring during stress tests we use a tool called nmon.
What I love about nmon is it has the ability to export to XLS and generate beautiful graphs for you.
It generates statistics for:
Memory Usage
CPU Usage
Network Usage
Disk I/O
Good luck :)

Resources