How to execute command when df -h return 98% full
I have a disk which is by the
/dev/sdb1 917G 816G 55G 94% /disk1
If its return 98% full, I would like to do the following
find . -size +80M -delete
How do I do it, I will run the shell script using cron
* * * * * sh /root/

Execute df -h, pipe the command output to grep matching "/dev/sdb1", and process that line by awk, checking to see if the numeric portion of column 5 ($5 in awk terms) is larger than or equal to 98. Don't forget to check for the possibility that it's over 98.

You need to schedule your script, check the disk utilization, and if the utilization is about 98% then delete files.
For scheduling your script you can reference the Wikipedia Cron entry.
There is an example of using the find command to delete files on the Unix & Linux site:
"How to delete directories based on find output?"
For your test, you'll need test constructs and command substitution. Note that you'll use "backticks" for with sh, but for bash the $(...) form has superseded backticks for command substitution.
To get your disk utilization you could use:
df | grep -F "/dev/sdb1" | awk '{print $5}'
--That's a functional grep to get your specific disk, awk to pull out the 5th column, and tr with the delete flag to get rid of the percent sign.
And your test might look something like this:
if [ `df | grep -F "/dev/vda1" | awk '{print $5}' | tr -d %` -ge 98 ];
then echo "Insert your specific cleanup command here.";
There are many ways to tackle the issue of course, but hope that helps!


How to grep text patterns from remote crontabs using xargs through SSH?

I'm developping a script to search for patterns within scripts executed from CRON on a bunch of remote servers through SSH.
Script on client machine -- SSH --> Remote Servers CRON/Scripts
For now I can't get the correct output.
Script on client machine
server_list=( '172.x.x.x' '172.x.x.y' '172.x.x.z' )
for s in ${server_list[#]}; do
ssh -i /home/user/.ssh/my_key.rsa user#${s} crontab -l | grep -v '^#\|^[[:space:]]*$' | cut -d ' ' -f 6- | awk '{print $1}' | grep -v '^$\|^echo\|^find\|^PATH\|^/usr/bin\|^/bin/' | xargs -0 grep -in 'server.tld\|10.x.x.x'
This only gives me the paths of scripts from crontab, not the matched lines and line number plus the first line is prefixed with "grep:" keyword (example below):
grep: /opt/directory/
How to get proper output, meaning the script path plus line number plus line of matching pattern?
Remote CRON examples
OO 6 * * * /opt/directory/ foo
30 6 * * * /opt/directory/ bar
Remote script content examples
1 ) This will match grep pattern
ping -c 4 server.tld && echo "server.tld ($1)"
2 ) This won't match grep pattern
ping -c 4 8.x.x.x && echo "8.x.x.x ($1)"
Without example input, it's really hard to see what your script is attempting to do. But the cron parsing could almost certainly be simplified tremendously by refactoring all of it into a single Awk script. Here is a quick stab, with obviously no way to test.
# No longer using an array for no good reason, so /bin/sh will work
for s in 172.x.x.x 172.x.x.y 172.x.x.z; do
ssh -i /home/user/.ssh/my_key.rsa "user#${s}" crontab -l |
awk '! /^#|^[[:space:]]*$/ && $6 !~ /^$|^(echo|find|PATH|\/usr\/bin|\/bin\/)/ { print $6 }' |
# no -0; use grep -E and properly quote literal dot
xargs grep -Ein 'server\.tld|10.x.x.x'
Your command would not output null-delimited data to xargs so probably the immediate problem was that xargs -0 would receive all the file names as a single file name which obviously does not exist, and you forgot to include the ": file not found" from the end of the error message.
The use of grep -E is a minor hack to enable a more modern regex syntax which is more similar to that in Awk, where you don't have to backslash the "or" pipe etc.
This script, like your original, runs grep on the local system where you run the SSH script. If you want to run the commands on the remote server, you will need to refactor to put the entire pipeline in single quotes or a here document:
for s in 172.x.x.x 172.x.x.y 172.x.x.z; do
ssh -i /home/user/.ssh/my_key.rsa "user#${s}" <<\________HERE
crontab -l |
awk '! /^#|^[[:space:]]*$/ && $6 !~ /^$|^(echo|find|PATH|\/usr\/bin|\/bin\/)/ { print $6 }' |
xargs grep -Ein 'server\.tld|10.x.x.x'
The refactored script contains enough complexities in the quoting that you probably don't want to pass it as an argument to ssh, which requires you to figure out how to quote strings both locally and remotely. It's easier then to pass it as standard input, which obviously just gets transmitted verbatim.
If you get "Pseudo-terminal will not be allocated because stdin is not a terminal.", try using ssh -t. Sometimes you need to add multiple -t options to completely get rid of this message.

Performance of wc -l

I ran the following command :
time for i in {1..100}; do find / -name "*.service" | wc -l; done
got a 100 lines of the result then :
real 0m35.466s
user 0m15.688s
sys 0m14.552s
I then ran the following command :
time for i in {1..100}; do find / -name "*.service" | awk 'END{print NR}'; done
got a 100 lines of the result then :
real 0m35.036s
user 0m15.848s
sys 0m14.056s
I precise I already ran find / -name "*.service" just before so it was cached for both commands.
I expected wc -l to be faster. Why is it not ?
other's have mentioned that you're probably timing find, not wc or awk. still, there may be interesting differences to explore between wc and awk in their various flavors.
here are the results I get:
Mac OS 10.10.5 awk 0.16m lines/second
GNU awk/gawk 4.1.4 4.4m lines/second
Mac OS 10.10.5 wc 6.8m lines/second
GNU wc 8.27 11m lines/second
i didn't use find, but instead used wc -l or `awk 'END{print NR}' on a large text file (66k lines) in a loop.
i varied the order of the commands and didn't find any deviations large enough to change the rankings i reported.
LC_CTYPE=C had no measurable effect on any of these.
don't use mac builtin command line tools except for trivial amounts of data.
GNU wc is faster than GNU awk at counting lines.
i use MacPorts GNU binaries. it would be interesting to see how Homebrew binaries compare. (i'm guessing they'd lose.)
Three things:
Such a small difference is usually not significant:
0m35.466s - 0m35.036s = 0m0.43s or 1.2%
Yet wc -l is faster (10x) than awk 'END{print NR}'.
% time seq 100000000 | awk 'END{print NR}' > /dev/null
real 0m13.624s
user 0m14.656s
sys 0m1.047s
% time seq 100000000 | wc -l > /dev/null
real 0m1.604s
user 0m2.413s
sys 0m0.623s
My guess is that the hard drive cache holds the find results, so after the first run with wc -l, most of the reads needed for find are in the cache. Presumably the difference in times between the initial find with disk reads and the second find with cache reads, would be greater than the difference in run times between awk and wc.
One way to test this is to reboot, which clears the hard disk cache, then run the two tests again, but in the reverse order, so that awk is run first. I'd expect that the first-run awk would be even slower than the first-run wc, and the second-run wc would be faster than the second-run awk.

How to trace the list of PIDs running on a specific core?

I'm trying to run a program on a dedicated core in Linux. (I know Jailhouse is a good way to do so, but I have to use off-the-shelf Linux. :-( )
Other processes, such as interrupt handlers, kernel threads, service progresses, may also run on the dedicated core occasionally. I want to disable as many such processes as possible. To do that, I need first pin down the list of processes that may run on the dedicated core.
My question is:
Is there any existing tools that I can use to trace the list of PIDs or processes that run on a specific core over a time interval?
Thank you very much for your time and help in this question!
TL;DR Dirty hacky solution.
DISCLAIMER: At some point stops working "column: line too long" :-/
Copy this to:
touch lastPIDs
touch CPU_PIDs
while true; do
ps ax -o cpuid,pid | tail -n +2 | sort | xargs -n 2 | grep -E "^$TARGET_CPU" | awk '{print $2}' > lastPIDs
for i in {1..100}; do printf "#\n" >> lastPIDs; done
cp CPU_PIDs aux
paste lastPIDs aux > CPU_PIDs
column -t CPU_PIDs > CPU_PIDs.humanfriendly.tsv
sleep 1
chmod +x
Then open CPU_PIDs.humanfriendly.tsv with your favorite editor, and ¡inspect!
The key is in the "ps -o cpuid,pid" bit, for more detailed info, please comment. :D
Infinite loop with
ps -o cpuid,pid | tail -n +2 | sort | xargs -n 2 | grep -E "^$TARGET_CPU" | awk '{print $2}' > lastPIDs
ps ax -o cpuid,pid
Show pid's associated to CPU
tail -n +2
remove headers
sort by cpuid
xargs -n 2
remove white spaces at begging
grep -E "^$TARGET_CPU"
filter by CPU id
awk '{print $2}'
get pid column
> lastPIDs
output to file those las pid's for the target CPU id
for i in {1..10}; do printf "#\n" >> lastPIDs; done
hack for pretty .tsv print with the "columns -t" command
cp CPU_PIDs aux
CPU_PIDs holds the whole timeline, we copy it to aux file to allow the next command to use it as input and output
paste lastPIDs aux > CPU_PIDs
Append lastPIDs columns to the whole timeline file CPU_PIDs
column -t CPU_PIDs > CPU_PIDs.humanfriendly.tsv
pretty print whole timeline CPU_PIDs file
stackoverflow answer to: ps utility in linux (procps), how to check which CPU is used
by Mikel
stackoverflow answer to: Echo newline in Bash prints literal \n
by sth
stackoverflow answer to: shell variable in a grep regex
by David W.
superuser answer to: Aligning columns in output from a UNIX command
Janne Pikkarainen
nixCraft article: HowTo: Unix For Loop 1 to 100 Numbers
The best way to obtain what you want is to operate as follows:
Use the isolcpus= Linux kernel boot parameter to "free" one core from the Linux scheduler
Disable the irqbalance daemon (in case it is executing)
Set the IRQs affinities to the other cores by manually writing the CPU mask on /proc/irq/<irq_number>/smp_affinity
Finally, run your program setting the affinity to the dedicated core through the taskset command.
In this case, such core will only execute your program. For checking, you can type ps -eLF and look at the PSR column (which specifies the CPU number).
Not a direct answer to the question, but I am usually using perf context-switches software event to identify the perturbation of the system or other processes on my benchmarks

Optimizing search in linux

I have a huge log file close to 3GB in size.
My task is to generate some reporting based on # of times something is being logged.
I need to find the number of time StringA , StringB , StringC is being called separately.
What I am doing right now is:
grep "StringA" server.log | wc -l
grep "StringB" server.log | wc -l
grep "StringC" server.log | wc -l
This is a long process and my script takes close to 10 minutes to complete. What I want to know is that whether this can be optimized or not ? Is is possible to run one grep command and find out the number of time StringA, StringB and StringC has been called individually ?
You can use grep -c instead of wc -l:
grep -c "StringA" server.log
grep can't report count of individual strings. You can use awk:
out=$(awk '/StringA/{a++;} /StringB/{b++;} /StringC/{c++;} END{print a, b, c}' server.log)
Then you can extract each count with a simple bash array:
echo "StringA="${arr[0]}
echo "StringA="${arr[1]}
echo "StringA="${arr[2]}
This (grep without wc) is certainly going to be faster and possibly awk solution is also faster. But I haven't measured any.
Certainly this approach could be optimized since grep doesn't perform any text indexing. I would use a text indexing engine like one of those from this review or this stackexchange QA . Also you may consider using journald from systemd which stores logs in a structured and indexed format so lookups are more effective.
So many greps so little time... :-)
According to David Lyness, a straight grep search is about 7 times as fast as an awk in large file searches.
If that is the case, the current approach could be optimized by changing grep to fgrep, but only if the patterns being searched for are not regular expressions. fgrep is optimized for fixed patterns.
If the number of instances is relatively small compared to the original log file entries, it may be an improvement to use the egrep version of grep to create a temporary file filled with all three instances:
egrep "StringA|StringB|StringC" server.log > tmp.log
grep "StringA" tmp.log | wc -c
grep "StringB" tmp.log | wc -c
grep "StringC" tmp.log | wc -c
The egrep variant of grep allows for a | (vertical bar/pipe) character to be used between two or more separate search strings so that you can find multiple strings in statement. You can use grep -E to do the same thing.
Full documentation is in the man grep page and information about the Extended Regular Expressions that egrep uses from the man 7 re_format command.

Give the mount point of a path

The following, very non-robust shell code will give the mount point of $path:
(for i in $(df|cut -c 63-99); do case $path in $i*) echo $i;; esac; done) | tail -n 1
Is there a better way to do this in shell?
This script is really awful, but has the redeeming quality that it Works On My Systems. Note that several mount points may be prefixes of $path.
On a Linux system:
cas#txtproof:~$ path=/sys/block/hda1
cas#txtproof:~$ for i in $(df -a|cut -c 57-99); do case $path in $i*) echo $i;; esac; done| tail -1
On a Mac OSX system
cas local$ path=/dev/fd/0
cas local$ for i in $(df -a|cut -c 63-99); do case $path in $i*) echo $i;; esac; done| tail -1
Note the need to vary cut's parameters, because of the way df's output differs; using awk solves this, but even awk is non-portable, given the range of result formatting various implementations of df return.
It looks like munging tabular output is the only way within the shell, but
df -P "$path" | tail -1 | awk '{ print $NF}'
based on ghostdog74's answer, is a big improvement on what I had. Note two new issues: firstly, df $path insists that $path names an existing file, the script I had above doesn't care; secondly, there are no worries about dereferencing symlinks. This doesn't work if you have mount points with spaces in them, which occurs if one has removable media with spaces in their volume names.
It's not difficult to write Python code to do the job properly.
df takes the path as parameter, so something like this should be fairly robust;
df "$path" | tail -1 | awk '{ print $6 }'
In theory stat will tell you the device the file is on, and there should be some way of mapping the device to a mount point.
For example, on linux, this should work:
stat -c '%m' $path
Always been a fan of using formatting options of a program, as it can be more robust than manipulating output (eg if the mount point has spaces). GNU df allows the following:
df --output=target "$path" | tail -1
Unfortunately there is no option I can see to prevent the printing of a header, so the tail is still required.
i don't know what your desired output is, therefore this is a guess
df | awk -v path="$path" 'NR>1 && $NF~path{
print $NF
Using cut with -c is not really reliable, since the output of df will be different , say a 5% can change to 10% and you will miss some characters. Since the mount point is always at the back, you can use fields and field delimiters. In the above, $NF is the last column which is the mount point.
I would take the source code to df and find out what it does besides calling stat as Douglas Leeder suggests.
Line-by-line parsing of the df output will cause problems as those lines often look like
1234567 1000000 200000 90% /path/to/mountpoint
With the added complexity of parsing those kinds of lines as well, probably calling stat and finding the mountpoint is less complex.
If you want to use only df and awk to find the filesystem device/remote share or a mount point and they include spaces you can cheat by defining the field separator of awk to be a regular expression that matches the format of the numeric sizes used to display total size, used space, available space and capacity percentage. By defining those columns as the field separator you are then left with $1 representing the filesystem device/remote share and $NF representing the mount path.
Take this for example:
[root#testsystem ~] df -P
Filesystem 1024-blocks Used Available Capacity Mounted on WITH SPACES 11695881728 11186577920 509303808 96% /mnt/MOUNT WITH SPACES
If you attempt to parse this with the quick and dirty awk '{print $1}' or awk '{print $NF}' you'll only get a portion of the filesystem/remote share path and mount path and that's no good. Now make awk use the four numeric data columns as the field separator.
[root#testsystem ~] df -P "/mnt/MOUNT WITH SPACES/path/to/file/filename.txt" | \
awk 'BEGIN {FS="[ ]*[0-9]+%?[ ]+"}; NR==2 {print $1}' WITH SPACES
[root#testsystem ~] df -P "/mnt/MOUNT WITH SPACES/path/to/file/filename.txt" | \
awk 'BEGIN {FS="[ ]*[0-9]+%?[ ]+"}; NR==2 {print $NF}'
Enjoy :-)
Edit: These commands are based on RHEL/CentOS/Fedora but should work on just about any distribution.
Just had the same problem. If some mount point (or the mounted device) is sufficent as in my case You can do:
DEVNO=$(stat -c '%d' /srv/sftp/testconsumer)
MP=$(findmnt -n -f -o TARGET /dev/block/$((DEVNO/2**8)):$((DEVNO&2**8-1)))
(or split the hex DEVNO %D with /dev/block/$((0x${DEVNO:0:${#DEVNO}-2})):$((0x${DEVNO:2:2})))
Alternatively the following loop come in to my mind, out of ideas why I cannot find proper basic command..
TARGETMOUNT=$(findmnt -d backward -f -n -o TARGET --target "$TARGETPATHTMP")
while [[ -z "$TARGETMOUNT" ]]
TARGETMOUNT=$(findmnt -d backward -f -n -o TARGET --target "$TARGETPATHTMP")
This should work always but is much more then I expect for such simple task?
(Edited to use readlink -f to allow for non existing files, -m or -e for readlink could be used instead if more components might not exists or all components must exists.)
mount | grep "^$path" | awk '{print $3}'
I missed this when I looked over prior questions: Python: Get Mount Point on Windows or Linux, which says that os.path.ismount(path) tells if path is a mount point.
My preference is for a shell solution, but this looks pretty simple.
I use this:
df -h $path | cut -f 1 -d " " | tail -1
Linux has this, which will avoid problem with spaces:
lsblk -no MOUNTPOINT ${device}
Not sure about BSD land.
f () { echo $6; }; f $(df -P "$path" | tail -n 1)
