Get average words per file - linux

If i have a folder of text files, how can I get the average words per file using Bash commands?
I know that I can use wc -w to get words per file, but i'm unsure of how to get the total number of words across all files, and then divide that number by the number of text files

This recursively traverses the filesystem and counts all words and files. In the end it divides the total number of words by the number of files:
find . -type f -exec wc -w {} \; | awk '{numfiles=numfiles+1;total += $1} END{print total/numfiles}'

You can get total word count by:
cat *.txt | wc -w
and file number by:
ls *.txt | wc -l
Then you can devide them.

It is just a piece of advice. You might use Loops and Variable Assignment.

Huang's solution is very good, but will emit errors on any directories. And division is a bit of a pain, since all arithmetic in the shell is with integers. Here's a script that does what you want:
#!/bin/sh
for file in *; do
test -f "$file" || continue
c=$( wc -w "$file" | awk '{print $1}' )
: $(( total += $c ))
: $(( count += 1 ))
done
echo $total $count 10k / p | dc | sed 's/0*$//'
But bunting's awk solution is the way to go.

Related

Suppressing summary information in `wc -l` output

I use the command wc -l count number of lines in my text files (also i want to sort everything through a pipe), like this:
wc -l $directory-path/*.txt | sort -rn
The output includes "total" line, which is the sum of lines of all files:
10 total
5 ./directory/1.txt
3 ./directory/2.txt
2 ./directory/3.txt
Is there any way to suppress this summary line? Or even better, to change the way the summary line is worded? For example, instead of "10", the word "lines" and instead of "total" the word "file".
Yet a sed solution!
1. short and quick
As total are comming on last line, $d is the sed command for deleting last line.
wc -l $directory-path/*.txt | sed '$d'
2. with header line addition:
wc -l $directory-path/*.txt | sed '$d;1ilines total'
Unfortunely, there is no alignment.
3. With alignment: formatting left column at 11 char width.
wc -l $directory-path/*.txt |
sed -e '
s/^ *\([0-9]\+\)/ \1/;
s/^ *\([0-9 ]\{11\}\) /\1 /;
/^ *[0-9]\+ total$/d;
1i\ lines filename'
Will do the job
lines file
5 ./directory/1.txt
3 ./directory/2.txt
2 ./directory/3.txt
4. But if really your wc version could put total on 1st line:
This one is for fun, because I don't belive there is a wc version that put total on 1st line, but...
This version drop total line everywhere and add header line at top of output.
wc -l $directory-path/*.txt |
sed -e '
s/^ *\([0-9]\+\)/ \1/;
s/^ *\([0-9 ]\{11\}\) /\1 /;
1{
/^ *[0-9]\+ total$/ba;
bb;
:a;
s/^.*$/ lines file/
};
bc;
:b;
1i\ lines file' -e '
:c;
/^ *[0-9]\+ total$/d
'
This is more complicated because we won't drop 1st line, even if it's total line.
This is actually fairly tricky.
I'm basing this on the GNU coreutils version of the wc command. Note that the total line is normally printed last, not first (see my comment on the question).
wc -l prints one line for each input file, consisting of the number of lines in the file followed by the name of the file. (The file name is omitted if there are no file name arguments; in that case it counts lines in stdin.)
If and only if there's more than one file name argument, it prints a final line containing the total number of lines and the word total. The documentation indicates no way to inhibit that summary line.
Other than the fact that it's preceded by other output, that line is indistinguishable from output for a file whose name happens to be total.
So to reliably filter out the total line, you'd have to read all the output of wc -l, and remove the final line only if the total length of the output is greater than 1. (Even that can fail if you have files with newlines in their names, but you can probably ignore that possibility.)
A more reliable method is to invoke wc -l on each file individually, avoiding the total line:
for file in $directory-path/*.txt ; do wc -l "$file" ; done
And if you want to sort the output (something you mentioned in a comment but not in your question):
for file in $directory-path/*.txt ; do wc -l "$file" ; done | sort -rn
If you happen to know that there are no files named total, a quick-and-dirty method is:
wc -l $directory-path/*.txt | grep -v ' total$'
If you want to run wc -l on all the files and then filter out the total line, here's a bash script that should do the job. Adjust the *.txt as needed.
#!/bin/bash
wc -l *.txt > .wc.out
lines=$(wc -l < .wc.out)
if [[ lines -eq 1 ]] ; then
cat .wc.out
else
(( lines-- ))
head -n $lines .wc.out
fi
rm .wc.out
Another option is this Perl one-liner:
wc -l *.txt | perl -e '#lines = <>; pop #lines if scalar #lines > 1; print #lines'
#lines = <> slurps all the input into an array of strings. pop #lines discards the last line if there are more than one, i.e., if the last line is the total line.
The program wc, always displays the total when they are two or more than two files ( fragment of wc.c):
if (argc > 2)
report ("total", total_ccount, total_wcount, total_lcount);
return 0;
also the easiest is to use wc with only one file and find present - one after the other - the file to wc:
find $dir -name '*.txt' -exec wc -l {} \;
Or as specified by liborm.
dir="."
find $dir -name '*.txt' -exec wc -l {} \; | sort -rn | sed 's/\.txt$//'
This is a job tailor-made for head:
wc -l | head --lines=-1
This way, you can still run in one process.
Can you use another wc ?
The POSIX wc(man -s1p wc) shows
If more than one input file operand is specified, an additional line shall be written, of the same format as the other lines, except that the word total (in the POSIX locale) shall be written instead of a pathname and the total of each column shall be written as appropriate. Such an additional line, if any, is written at the end of the output.
You said the Total line was the first line, the manual states its the last and other wc's don't show it at all. Removing the first or last line is dangerous, so I would grep -v the line with the total (in the POSIX locale...), or just grep the slash that's part of all other lines:
wc -l $directory-path/*.txt | grep "/"
Not the most optimized way since you can use combinations of cat, echo, coreutils, awk, sed, tac, etc., but this will get you want you want:
wc -l ./*.txt | awk 'BEGIN{print "Line\tFile"}1' | sed '$d'
wc -l ./*.txt will extract the line count. awk 'BEGIN{print "Line\tFile"}1' will add the header titles. The 1 corresponds to the first line of the stdin. sed '$d' will print all lines except the last one.
Example Result
Line File
6 ./test1.txt
1 ./test2.txt
You can solve it (and many other problems that appear to need a for loop) quite succinctly using GNU Parallel like this:
parallel wc -l ::: tmp/*txt
Sample Output
3 tmp/lines.txt
5 tmp/unfiltered.txt
42 tmp/file.txt
6 tmp/used.txt
The simplicity of using just grep -c
I rarely use wc -l in my scripts because of these issues. I use grep -c instead. Though it is not as efficient as wc -l, we don't need to worry about other issues like the summary line, white space, or forking extra processes.
For example:
/var/log# grep -c '^' *
alternatives.log:0
alternatives.log.1:3
apache2:0
apport.log:160
apport.log.1:196
apt:0
auth.log:8741
auth.log.1:21534
boot.log:94
btmp:0
btmp.1:0
<snip>
Very straight forward for a single file:
line_count=$(grep -c '^' my_file.txt)
Performance comparison: grep -c vs wc -l
/tmp# ls -l *txt
-rw-r--r-- 1 root root 721009809 Dec 29 22:09 x.txt
-rw-r----- 1 root root 809338646 Dec 29 22:10 xyz.txt
/tmp# time grep -c '^' *txt
x.txt:7558434
xyz.txt:8484396
real 0m12.742s
user 0m1.960s
sys 0m3.480s
/tmp/# time wc -l *txt
7558434 x.txt
8484396 xyz.txt
16042830 total
real 0m9.790s
user 0m0.776s
sys 0m2.576s
Similar to Mark Setchell's answer you can also use xargs with an explicit separator:
ls | xargs -I% wc -l %
Then xargs explicitly doesn't send all the inputs to wc, but one operand line at a time.
Shortest answer:
ls | xargs -l wc
What about using sed with the pattern removal option as below which would only remove the total line if it is present (but also any files with total in them).
wc -l $directory-path/*.txt | sort -rn | sed '/total/d'

Find a maximum and minimum in subdirectory (less than 10 seconds with a shell script)

I have several files of this type:
Sensor Location Temp Threshold
------ -------- ---- ---------
#1 PROCESSOR_ZONE 23C/73F 62C/143F
#2 CPU#1 30C/86F 73C/163F
#3 I/O_ZONE 32C/89F 68C/154F
#4 CPU#2 22C/71F 73C/163F
#5 POWER_SUPPLY_BAY 17C/62F 55C/131F
There is approximately 124630 in several sub-directories
I try to determine the maximal and minimal temperature of PROCESSOR_ZONE
Here is my script at the moment:
#!/bin/bash
max_value=0
min_value=50
find $1 -name hp-temps.txt -exec grep "PROCESSOR_ZONE" {} + | sed -e 's/\ \+/,/g' | cut -d, -f3 | cut -dC -f1 | while read current_value ;
do
echo $current_value;
done
output after my script:
30
28
26
23
...
My script is not finished and it sets 10 minutes to show all the temperatures.
I think that to arrive there, I have to put the result of my command in a file, sort out it and get back the first line which is the max and the last one the min. But I do not know how to make it.
Instead of this bit:
... | while read current_value ;
do
echo $current_value;
done
just direct the output after cut to a file:
... > temperatures.txt
If you need them sorted, sort them first:
... | sort -n > temperatures.txt
Then the first line of the file will be the minimum temperature returned, and the last line will be the maximum temperature.
Performance suggestion:
This find command runs a new grep process on each file. If there are hundreds of thousands of these files in your directory, it will run grep hundreds of thousands of times. You can speed it up by telling find to run a grep command once for each batch of a few thousand files:
find $1 -name hp-temps.txt -print | xargs grep -h "PROCESSOR_ZONE" | sed ...
The find command prints the filenames out on standard output; the xargs command reads these in and runs grep on a batch of files at once. The -h option to grep means "don't include the filename in the output".
Running it this way should speed up your search considerably if there are thousands of files to be processed.
If your script is slow, you might want to analyse first which command is slow. Using find on for example Windows/Cygwin with a lot of files is going to be slow.
Perl is a perfect match for your problem:
find $1 -name hp-temps.txt -exec perl -ne '/PROCESSOR_ZONE\s+(\d+)C/ and print "$1\n"' {} +
In this way you do a (Perl) regular expression match on a lot of files at the same time. The brackets match the temperature digits (\d+) and $1 references that. The and makes sure print only is executed if the match succeeds.
You could even consider using opendir and readdir to recursively decend into directories in Perl to get rid of the find, but it is not going to be faster.
To get the minimum and maximum values:
find $1 -name hp-temps.txt -exec perl -ne 'if (/PROCESSOR_ZONE\s+(\d+)C/){ $min=$1 if $1<$min or $min == undef; $max=$1 if $1>$max }; sub END { print "$min - $max\n" }' {} +
With 100k+ lines of output on your terminal this should save quite a bit of time.
#!/bin/bash
max_value=0
min_value=50
find $1 -name file.txt -exec grep "PROCESSOR_ZONE" {} + | sed -e 's/\ \+/,/g' | cut -d, -f3 | cut -dC -f1 |
{
while read current_value ; do
#For maximum
if [[ $current_value -gt $max_value ]]; then
max_value=$current_value
fi
#For minimum
if [[ $current_value -lt $min_value ]]; then
min_value=$current_value
echo "new min $min_value"
fi
done
echo "NEW MAX : $max_value °C"
echo "NEW MIN : $min_value °C"
}

File with the most lines in a directory NOT bytes

I'm trying to to wc -l an entire directory and then display the filename in an echo with the number of lines.
To add to my frustration, the directory has to come from a passed argument. So without looking stupid, can someone first tell me why a simple wc -l $1 doesn't give me the line count for the directory I type in the argument? I know i'm not understanding it completely.
On top of that I need validation too, if the argument given is not a directory or there is more than one argument.
wc works on files rather than directories so, if you want the word count on all files in the directory, you would start with:
wc -l $1/*
With various gyrations to get rid of the total, sort it and extract only the largest, you could end up with something like (split across multiple lines for readability but should be entered on a single line):
pax> wc -l $1/* 2>/dev/null
| grep -v ' total$'
| sort -n -k1
| tail -1l
2892 target_dir/big_honkin_file.txt
As to the validation, you can check the number of parameters passed to your script with something like:
if [[ $# -ne 1 ]] ; then
echo 'Whoa! Wrong parameteer count'
exit 1
fi
and you can check if it's a directory with:
if [[ ! -d $1 ]] ; then
echo 'Whoa!' "[$1]" 'is not a directory'
exit 1
fi
Is this what you want?
> find ./test1/ -type f|xargs wc -l
1 ./test1/firstSession_cnaiErrorFile.txt
77 ./test1/firstSession_cnaiReportFile.txt
14950 ./test1/exp.txt
1 ./test1/test1_cnaExitValue.txt
15029 total
so your directory which is the argument should go here:
find $your_complete_directory_path/ -type f|xargs wc -l
I'm trying to to wc -l an entire directory and then display the
filename in an echo with the number of lines.
You can do a find on the directory and use -exec option to trigger wc -l. Something like this:
$ find ~/Temp/perl/temp/ -exec wc -l '{}' \;
wc: /Volumes/Data/jaypalsingh/Temp/perl/temp/: read: Is a directory
11 /Volumes/Data/jaypalsingh/Temp/perl/temp//accessor1.plx
25 /Volumes/Data/jaypalsingh/Temp/perl/temp//autoincrement.pm
12 /Volumes/Data/jaypalsingh/Temp/perl/temp//bless1.plx
14 /Volumes/Data/jaypalsingh/Temp/perl/temp//bless2.plx
22 /Volumes/Data/jaypalsingh/Temp/perl/temp//classatr1.plx
27 /Volumes/Data/jaypalsingh/Temp/perl/temp//classatr2.plx
7 /Volumes/Data/jaypalsingh/Temp/perl/temp//employee1.pm
18 /Volumes/Data/jaypalsingh/Temp/perl/temp//employee2.pm
26 /Volumes/Data/jaypalsingh/Temp/perl/temp//employee3.pm
12 /Volumes/Data/jaypalsingh/Temp/perl/temp//ftp.plx
14 /Volumes/Data/jaypalsingh/Temp/perl/temp//inherit1.plx
16 /Volumes/Data/jaypalsingh/Temp/perl/temp//inherit2.plx
24 /Volumes/Data/jaypalsingh/Temp/perl/temp//inherit3.plx
33 /Volumes/Data/jaypalsingh/Temp/perl/temp//persisthash.pm
Nice question!
I saw the answers. Some are pretty good. The find ...|xrags is my most preferred. It could be simplified anyway using find ... -exec wc -l {} + syntax. But there is a problem. When the command line buffer is full a wc -l ... is called and every time a <number> total line is printer. As wc has no arg to disable this feature wc has to be reimplemented. To filter out these lines with grep is not nice:
So my complete answer is
#!/usr/bin/bash
[ $# -ne 1 ] && echo "Bad number of args">&2 && exit 1
[ ! -d "$1" ] && echo "Not dir">&2 && exit 1
find "$1" -type f -exec awk '{++n[FILENAME]}END{for(i in n) printf "%8d %s\n",n[i],i}' {} +
Or using less temporary space, but a little bit larger code in awk:
find "$1" -type f -exec awk 'function pr(){printf "%8d %s\n",n,f}FNR==1{f&&pr();n=0;f=FILENAME}{++n}END{pr()}' {} +
Misc
If it should not be called for subdirectories then add -maxdepth 1 before -type to find.
It is pretty fast. I was afraid that it would be much slower then the find ... wc + version, but for a directory containing 14770 files (in several subdirs) the wc version run 3.8 sec and awk version run 5.2 sec.
awk and wc consider the not \n ended lines differently. The last line ended with no \n is not counted by wc. I prefer to count it as awk does.
It does not print the empty files
To find the file with most lines in the current directory and its subdirectories, with zsh:
lines() REPLY=$(wc -l < "$REPLY")
wc -l -- **/*(D.nO+lined[1])
That defines a lines function which is going to be used as a glob sorting function that returns in $REPLY the number of lines of the file whose path is given in $REPLY.
Then we use zsh's recursive globbing **/* to find regular files (.), numerically (n) reverse sorted (O) with the lines function (+lines), and select the first one [1]. (D to include dotfiles and traverse dotdirs).
Doing it with standard utilities is a bit tricky if you don't want to make assumptions on what characters file names may contain (like newline, space...). With GNU tools as found on most Linux distributions, it's a bit easier as they can deal with NUL terminated lines:
find . -type f -exec sh -c '
for file do
size=$(wc -c < "$file") &&
printf "%s\0" "$size:$file"
done' sh {} + |
tr '\n\0' '\0\n' |
sort -rn |
head -n1 |
tr '\0' '\n'
Or with zsh or GNU bash syntax:
biggest= max=-1
find . -type f -print0 |
{
while IFS= read -rd '' file; do
size=$(wc -l < "$file") &&
((size > max)) &&
max=$size biggest=$file
done
[[ -n $biggest ]] && printf '%s\n' "$max: $biggest"
}
Here's one that works for me with the git bash (mingw32) under windows:
find . -type f -print0| xargs -0 wc -l
This will list the files and line counts in the current directory and sub dirs. You can also direct the output to a text file and import it into Excel if needed:
find . -type f -print0| xargs -0 wc -l > fileListingWithLineCount.txt

Linux Shell - String manipulation then calculating age of file in minutes

I am writing a script that calculates the age of the oldest file in a directory. The first commands run are:
OLDFILE=`ls -lt $DIR | grep "^-" | tail -1 `
echo $OLDFILE
The output contains a lot more than just the filename. eg
-rwxrwxr-- 1 abc abc 334 May 10 2011 ABCD_xyz20110510113817046.abc.bak
Q1/. How do I obtain the output after the last space of the above line? This would give me the filename. I realise some sort of string manipulation is required but am new to shell scripting.
Q2/. How do I obtain the age of this file in minutes?
To obtain just the oldest file's name,
ls -lt | awk '/^-/{file=$NF}END{print file}'
However, this is not robust if you have files with spaces in their names, etc. Generally, you should try to avoid parsing the output from ls.
With stat you can obtain a file's creation date in machine-readable format, expressed as seconds since Jan 1, 1970; with date +%s you can obtain the current time in the same format. Subtract and divide by 60. (More Awk skills would come in handy for the arithmetic.)
Finally, for an alternate solution, look at the options for find; in particular, its printf format strings allow you to extract a file's age. The following will directly get you the age in seconds and inode number of the oldest file:
find . -maxdepth 1 -type f -printf '%T# %i\n' |
sort -n | head -n 1
Using the inode number avoids the issues of funny file names; once you have a single inode, converting that to a file name is a snap:
find . -maxdepth 1 -inum "$number"
Tying the two together, you might want something like this:
# set -- Replace $# with output from command
set -- $(find . -maxdepth 1 -type f -printf '%T# %i\n' |
sort -n | head -n 1)
# now $1 is the timestamp and $2 is the inode
oldest_filename=$(find . -maxdepth 1 -inum "$2")
age_in_minutes=$(date +%s | awk -v d="$1" '{ print ($1 - d) / 60 }')
an awk solution, giving you how old the file is in minutes, (as your ls output does not contain the min of creation, so 00 is assumed by default). Also as tripleee pointed out, ls outputs are inherently risky to be parsed.
[[bash_prompt$]]$ echo $l; echo "##############";echo $l | awk -f test.sh ; echo "#############"; cat test.sh
-rwxrwxr-- 1 abc abc 334 May 20 2013 ABCD_xyz20110510113817046.abc.bak
##############
File ABCD_xyz20110510113817046.abc.bak is 2074.67 min old
#############
BEGIN{
m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|")
for(o=1;o<=m;o++){
date[d[o]]=sprintf("%02d",o)
}
}
{
month=date[$6];
old_time=mktime($8" "month" "$7" "00" "00" "00);
curr_time=systime();
print "File " $NF " is " (curr_time-old_time)/60 " min old";
}
For Q2 a bash one-liner could be:
let mtime_minutes=\(`date +%s`-`stat -c%Y "$file_to_inspect"`\)/60
You probably want ls -t. Including the '-l' option explicitly asks for ls to give you all that other info that you actually want to get rid of.
However what you probably REALLY want is find, first:
NEWEST=$(find . -maxdepth 1 -type f | cut -c3- |xargs ls -t| head -1)
this will get you the plain filename of the newest item, no other processing needed.
Directories will be correctly excluded. (thanks to tripleee for pointing out that's what you were aiming for.)
For the second question, you can use stat and bc:
TIME_MINUTES=$(stat --format %Y/60 "$NEWEST" |bc -l)
remove the '-l' option from bc if you only want a whole number of minutes.

How to find file in each directory with the highest number as filename?

I have a file structure that looks like this
./501.res/1.bin
./503.res/1.bin
./503.res/2.bin
./504.res/1.bin
and I would like to find the file path to the .bin file in each directory which have the highest number as filename. So the output I am looking for would be
./501.res/1.bin
./503.res/2.bin
./504.res/1.bin
The highest number a file can have is 9.
Question
How do I do that in BASH?
I have come as far as find .|grep bin|sort
Globs are guaranteed to be expanded in lexical order.
for dir in ./*/
do
files=($dir/*) # create an array
echo "${files[#]: -1}" # access its last member
done
What about using awk? You can get the FIRST occurrence really simply:
[ghoti#pc ~]$ cat data1
./501.res/1.bin
./503.res/1.bin
./503.res/2.bin
./504.res/1.bin
[ghoti#pc ~]$ awk 'BEGIN{FS="."} a[$2] {next} {a[$2]=1} 1' data1
./501.res/1.bin
./503.res/1.bin
./504.res/1.bin
[ghoti#pc ~]$
To get the last occurrence you could pipe through a couple of sorts:
[ghoti#pc ~]$ sort -r data1 | awk 'BEGIN{FS="."} a[$2] {next} {a[$2]=1} 1' | sort
./501.res/1.bin
./503.res/2.bin
./504.res/1.bin
[ghoti#pc ~]$
Given that you're using "find" and "grep", you could probably do this:
find . -name \*.bin -type f -print | sort -r | awk 'BEGIN{FS="."} a[$2] {next} {a[$2]=1} 1' | sort
How does this work?
The find command has many useful options, including the ability to select your files by glob, select the type of file, etc. Its output you already know, and that becomes the input to sort -r.
First, we sort our input data in reverse (sort -r). This insures that within any directory, the highest numbered file will show up first. That result gets fed into awk. FS is the field separator, which makes $2 into things like "/501", "/502", etc. Awk scripts have sections in the form of condition {action} which get evaluated for each line of input. If a condition is missing, the action runs on every line. If "1" is the condition and there is no action, it prints the line. So this script is broken out as follows:
a[$2] {next} - If the array a with the subscript $2 (i.e. "/501") exists, just jump to the next line. Otherwise...
{a[$2]=1} - set the array a subscript $2 to 1, so that in future the first condition will evaluate as true, then...
1 - print the line.
The output of this awk script will be the data you want, but in reverse order. The final sort puts things back in the order you'd expect.
Now ... that's a lot of pipes, and sort can be a bit resource hungry when you ask it to deal with millions of lines of input at the same time. This solution will be perfectly sufficient for small numbers of files, but if you're dealing with large quantities of input, let us know, and I can come up with an all-in-one awk solution (that will take longer than 60 seconds to write).
UPDATE
Per Dennis' sage advice, the awk script I included above could be improved by changing it from
BEGIN{FS="."} a[$2] {next} {a[$2]=1} 1
to
BEGIN{FS="."} $2 in a {next} {a[$2]} 1
While this is functionally identical, the advantage is that you simply define array members rather than assigning values to them, which may save memory or cpu depending on your implementation of awk. At any rate, it's cleaner.
Tested:
find . -type d -name '*.res' | while read dir; do
find "$dir" -maxdepth 1 | sort -n | tail -n 1
done
I came up with someting like this:
for dir in $(find . -mindepth 1 -type d | sort); do
file=$(ls "$dir" | sort | tail -n 1);
[ -n "$file" ] && (echo "$dir/$file");
done
Maybe it can be simpler
If invoking a shell from within find is an option try this
find * -type d -exec sh -c "echo -n './'; ls -1 {}/*.bin | sort -n -r | head -n 1" \;
And here is one liner
find . -mindepth 1 -type d | sort | sed -e "s/.*/ls & | sort | tail -n 1 | xargs -I{} echo &\/{}/" | bash

Resources