Performance of wc -l - linux

I ran the following command :
time for i in {1..100}; do find / -name "*.service" | wc -l; done
got a 100 lines of the result then :
real 0m35.466s
user 0m15.688s
sys 0m14.552s
I then ran the following command :
time for i in {1..100}; do find / -name "*.service" | awk 'END{print NR}'; done
got a 100 lines of the result then :
real 0m35.036s
user 0m15.848s
sys 0m14.056s
I precise I already ran find / -name "*.service" just before so it was cached for both commands.
I expected wc -l to be faster. Why is it not ?

other's have mentioned that you're probably timing find, not wc or awk. still, there may be interesting differences to explore between wc and awk in their various flavors.
here are the results I get:
Mac OS 10.10.5 awk 0.16m lines/second
GNU awk/gawk 4.1.4 4.4m lines/second
Mac OS 10.10.5 wc 6.8m lines/second
GNU wc 8.27 11m lines/second
i didn't use find, but instead used wc -l or `awk 'END{print NR}' on a large text file (66k lines) in a loop.
i varied the order of the commands and didn't find any deviations large enough to change the rankings i reported.
LC_CTYPE=C had no measurable effect on any of these.
conclusions
don't use mac builtin command line tools except for trivial amounts of data.
GNU wc is faster than GNU awk at counting lines.
i use MacPorts GNU binaries. it would be interesting to see how Homebrew binaries compare. (i'm guessing they'd lose.)

Three things:
Such a small difference is usually not significant:
0m35.466s - 0m35.036s = 0m0.43s or 1.2%
Yet wc -l is faster (10x) than awk 'END{print NR}'.
% time seq 100000000 | awk 'END{print NR}' > /dev/null
real 0m13.624s
user 0m14.656s
sys 0m1.047s
% time seq 100000000 | wc -l > /dev/null
real 0m1.604s
user 0m2.413s
sys 0m0.623s
My guess is that the hard drive cache holds the find results, so after the first run with wc -l, most of the reads needed for find are in the cache. Presumably the difference in times between the initial find with disk reads and the second find with cache reads, would be greater than the difference in run times between awk and wc.
One way to test this is to reboot, which clears the hard disk cache, then run the two tests again, but in the reverse order, so that awk is run first. I'd expect that the first-run awk would be even slower than the first-run wc, and the second-run wc would be faster than the second-run awk.

Related

Total number of lines in a directory

I have a directory with thousands of files (100K for now). When I use wc -l ./*, I'll get:
c1 ./test1.txt
c2 ./test2.txt
...
cn ./testn.txt
c1+c2+...+cn total
Because there are a lot of files in the directory, I just want to see the total count and not the details. Is there any way to do so?
I tried several ways and I got following error:
Argument list too long
If what you want is the total number of lines and nothing else, then I would suggest the following command:
cat * | wc -l
This catenates the contents of all of the files in the current working directory and pipes the resulting blob of text through wc -l.
I find this to be quite elegant. Note that the command produces no extraneous output.
UPDATE:
I didn't realize your directory contained so many files. In light of this information, you should try this command:
for file in *; do cat "$file"; done | wc -l
Most people don't know that you can pipe the output of a for loop directly into another command.
Beware that this could be very slow. If you have 100,000 or so files, my guess would be around 10 minutes. This is a wild guess because it depends on several parameters that I'm not able to check.
If you need something faster, you should write your own utility in C. You could make it surprisingly fast if you use pthreads.
Hope that helps.
LAST NOTE:
If you're interested in building a custom utility, I could help you code one up. It would be a good exercise, and others might find it useful.
Credit: this builds on #lifecrisis's answer, and extends it to handle large numbers of files:
find . -maxdepth 1 -type f -exec cat {} + | wc -l
find will find all of the files in the current directory, break them into groups as large as can be passed as arguments, and run cat on the groups.
awk 'END {print NR" total"}' ./*
Would be an interesting comparison to find out how many lines don't end with a new line.
Combining the awk and Gordon’s find solutions and avoiding the "." files.
find ./* -maxdepth 0 -type f -exec awk 'END {print NR}' {} +
No idea if this is better or worse but it does give a more accurate count (for me) and does not count lines in "." files. Using ./* is just a guess that appears to work.
Still need depth and ./* requires "0" depth.
I did get the same result with the "cat" and "awk" solutions (using the same find) since the "cat *" takes care of the new line issue. I don't have a directory with enough files to measure time. Interesting, I'm liking the "cat" solution.
This will give you the total count for all the files (including hidden files) in your current directory :
$ find . -maxdepth 1 -type f | xargs wc -l | grep total
1052 total
To count for files excluding hidden files use :
find . -maxdepth 1 -type f -not -path "*/\.*" | xargs wc -l | grep total
(Apologies for adding this as an answer—but I do not have enough reputation for commenting.)
A comment on #lifecrisis's answer. Perhaps cat is slowing things down a bit. We could replace cat by wc -l and then use awkto add the numbers. (This could be faster since much less data needs to go throught the pipe.)
That is
for file in *; do wc -l "$file"; done | awk '{sum += $1} END {print sum}'
instead of
for file in *; do cat "$file"; done | wc -l
(Disclaimer: I am not incorporating many of the improvements in other answers, but I thought the point was valid enough to write down.)
Here are my results for comparison (I ran the newer version first so that any cache effects would go against the newer candidate).
$ time for f in `seq 1 1500`; do head -c 5M </dev/urandom >myfile-$f |sed -e 's/\(................\)/\1\n/g'; done
real 0m50.360s
user 0m4.040s
sys 0m49.489s
$ time for file in myfile-*; do wc -l "$file"; done | awk '{sum += $1} END {print sum}'
30714902
real 0m3.455s
user 0m2.093s
sys 0m1.515s
$ time for file in myfile-*; do cat "$file"; done | wc -l
30714902
real 0m4.481s
user 0m2.544s
sys 0m4.312s
iF you want to know only total number Lines in directory excluding total line
ls -ltr | sed -n '/total/!p' | awk '{print NR}'
Previous comment will give total count of lines which includes only count of lines in all files
Below command will provide the total count of lines from all files in path
for i in `ls- ltr | awk ‘$1~”^-rw”{print $9}’`; do wc -l $I | awk ‘{print $1}’; done >>/var/tmp/filelinescount.txt
Cat /var/tmp/filelinescount.txt| sed -r “s/\s+//g”|tr “\n” “+”| sed “s:+$::g”| sed ’s/^/“/g’| sed ’s/$/“/g’ | awk ‘{print “echo” “ “ $0”+bc”}’| sh

Optimizing search in linux

I have a huge log file close to 3GB in size.
My task is to generate some reporting based on # of times something is being logged.
I need to find the number of time StringA , StringB , StringC is being called separately.
What I am doing right now is:
grep "StringA" server.log | wc -l
grep "StringB" server.log | wc -l
grep "StringC" server.log | wc -l
This is a long process and my script takes close to 10 minutes to complete. What I want to know is that whether this can be optimized or not ? Is is possible to run one grep command and find out the number of time StringA, StringB and StringC has been called individually ?
You can use grep -c instead of wc -l:
grep -c "StringA" server.log
grep can't report count of individual strings. You can use awk:
out=$(awk '/StringA/{a++;} /StringB/{b++;} /StringC/{c++;} END{print a, b, c}' server.log)
Then you can extract each count with a simple bash array:
arr=($out)
echo "StringA="${arr[0]}
echo "StringA="${arr[1]}
echo "StringA="${arr[2]}
This (grep without wc) is certainly going to be faster and possibly awk solution is also faster. But I haven't measured any.
Certainly this approach could be optimized since grep doesn't perform any text indexing. I would use a text indexing engine like one of those from this review or this stackexchange QA . Also you may consider using journald from systemd which stores logs in a structured and indexed format so lookups are more effective.
So many greps so little time... :-)
According to David Lyness, a straight grep search is about 7 times as fast as an awk in large file searches.
If that is the case, the current approach could be optimized by changing grep to fgrep, but only if the patterns being searched for are not regular expressions. fgrep is optimized for fixed patterns.
If the number of instances is relatively small compared to the original log file entries, it may be an improvement to use the egrep version of grep to create a temporary file filled with all three instances:
egrep "StringA|StringB|StringC" server.log > tmp.log
grep "StringA" tmp.log | wc -c
grep "StringB" tmp.log | wc -c
grep "StringC" tmp.log | wc -c
The egrep variant of grep allows for a | (vertical bar/pipe) character to be used between two or more separate search strings so that you can find multiple strings in statement. You can use grep -E to do the same thing.
Full documentation is in the man grep page and information about the Extended Regular Expressions that egrep uses from the man 7 re_format command.

How to count the times a word appears in a file using a shell?

Given a file containing text, I would like to count the occurence of a string "ABCDXYZ" ?
$ cat file.txt
foo
bar
foo
bar
baz
baz
bug
bat
foo
bar
so
on
and
so
on
foo
Let's count foo!
Many times I see people using the following to count words:
$ grep -o 'foo' file.txt | wc -l
Here are a few examples: 1, 2, 3 and even this youtube video.
This really a bad way, for a few reasons:
It shows you never read man grep either BSD grep (NetBSD, OpenBSD, FreeBSD) or GNU grep
All of these implementations offer you the option to count things -c.
The NetBSD man page describes this options very clearly:
-c, --count
Suppress normal output; instead print a count of matching lines
for each input file. With the -v, --invert-match option (see
below), count non-matching lines.
you can use just one command:
$ grep foo -c file.txt
Not only you could, you should and you'll save yourself lot's time of searching by reading man pages, and understanding the tools you have in hand!
Speed bonus
You can also make your greps faster, because pipes are quite expensive.
One the short file shown above a pipe is 2 times slower comparing to using the option -c:
$ time grep foo -c file.txt
4
real 0m0.001s
user 0m0.000s
sys 0m0.001s
$ time grep -o 'foo' file.txt | wc -l
4
real 0m0.002s
user 0m0.000s
sys 0m0.003s
On large files this can be even more significant. Here I copied my file to a larger time a hundred thousand times:
$ for i in `seq 1 300000`; do cat file.txt >> largefile.txt; done
^C
$ wc -l largefile.txt
1111744 largefile.txt
Now here is how slow is using pipe:
$ time grep -o foo largefile.txt | wc -l
277936
real 0m0.216s
user 0m0.214s
sys 0m0.010s
And here is how fast is only using grep:
$ time grep -c foo largefile.txt
277936
real 0m0.032s
user 0m0.028s
sys 0m0.004s
These benchmarks where done on a machine with Core i5 and plentty of RAM, it would have been significantly on an embeded device with little RAM and CPU resources.
To sum, don't use pipes where you don't need them. Often UNIX tools have overlapping functionalities. Know your tools, read how to use them!
To count the occurence of a word in a file it's enough to use:
$ grep -c <word> <filename>
If you want to generalize to count all words, use:
sort file.txt | uniq -c

How to get grep -m1 to work in OSX

I have a script which I used perfectly fine in Linux, but now that I've switched over to Mac, the script still runs but has slightly different behavior.
This is a script for tallying student attendance at departmental functions. We use a portable barcode scanner to scan their ID's, and then save all scans in one csv file per date.
I used grep -m1 $ID csvfolder/* | wc -l in the past to get a count of how many files their ID shows up in. The -m1 is necessary to make sure they don't get "extra credit" for repeatedly scanning in at the same event.
However, when I use this same command in Mac, it exits grep when it has found the first match in the first file. So if the student shows up in 4 files, wc -l still returns 1
How can I (without installing the GNU versions) emulate this feature?
I don't have Mac OS X handy to test it with, but the following is Posix-standard afaik:
grep -l "$ID" csvfolder/* | wc -l
The grep will print the name of each file which contains a match. That should work with Gnu grep equally.
You could alternatively use awk for this task:
awk -v id="$ID" '$0 ~ id{print 1; exit}' csvfolder/* | wc -l

Simpler way of extracting text from file

I've put together a batch script to generate panoramas using the command line tools used by Hugin. One interesting thing about several of those tools is they allow multi-core usage, but this option has to be flagged within the command.
What I've come up with so far:
#get the last fields of each line in the file, initialize the line counter
results=$(more /proc/cpuinfo | awk '{print ($NF)}')
count=0
#loop through the results till the 12th line for cpu core count
for result in $results; do
if [ $count == 12 ]; then
echo "Core Count: $result"
fi
count=$((count+1))
done
Is there a simpler way to do this?
result=$(awk 'NR==12{print $NF}' /proc/cpuinfo)
To answer your question about getting the first/last so many lines, you could use head and tail,e.g. :
cat /proc/cpuinfo | awk '{print ($NF)}' | head -12 | tail -1
But instead of searching for the 12th line, how about searching semantically for any line containing cores. For example, some machines may have multiple cores, so you may want to sum the results:
cat /proc/cpuinfo | grep "cores" | awk '{s+=$NF} END {print s}'
count=$(getconf _NPROCESSORS_ONLN)
see getconf(1) and sysconf(3) constants.
According to the Linux manpage, _SC_NPROCESSORS_ONLN "may not be standard". My guess is this requires glibc or even a Linux system specifically. If that doesn't work, I'd probably take looking at /sys/class/cpuid (perhaps there's something better?) over parsing /proc/cpuinfo. None of the above are completely portable.
There are many ways:
head -n 12 /proc/cpuinfo | tail -1 | awk -F: '{print $2}'
grep 'cpu cores' /proc/cpuinfo | head -1 | awk -F: '{print $2}'
and so on.
But I must note that you take only the information from the first section of /proc/cpuinfo and I am not sure that that is what you need.
And if the cpuinfo changes its format ;) ? Maybe something like this will be better:
cat /proc/cpuinfo|sed -n 's/cpu cores\s\+:\s\+\(.*\)/\1/p'|tail -n 1
And make sure to sum the cores. Mine has got like 12 or 16 of them ;)
unsure what you are trying to do and why what ormaaj said above wouldn't wouldn't work either. my instinct based on your description would have been much simpler along the lines of.
grep processor /proc/cpuinfo | wc -l

Resources