How to count the times a word appears in a file using a shell? - linux

Given a file containing text, I would like to count the occurence of a string "ABCDXYZ" ?
$ cat file.txt
foo
bar
foo
bar
baz
baz
bug
bat
foo
bar
so
on
and
so
on
foo
Let's count foo!

Many times I see people using the following to count words:
$ grep -o 'foo' file.txt | wc -l
Here are a few examples: 1, 2, 3 and even this youtube video.
This really a bad way, for a few reasons:
It shows you never read man grep either BSD grep (NetBSD, OpenBSD, FreeBSD) or GNU grep
All of these implementations offer you the option to count things -c.
The NetBSD man page describes this options very clearly:
-c, --count
Suppress normal output; instead print a count of matching lines
for each input file. With the -v, --invert-match option (see
below), count non-matching lines.
you can use just one command:
$ grep foo -c file.txt
Not only you could, you should and you'll save yourself lot's time of searching by reading man pages, and understanding the tools you have in hand!
Speed bonus
You can also make your greps faster, because pipes are quite expensive.
One the short file shown above a pipe is 2 times slower comparing to using the option -c:
$ time grep foo -c file.txt
4
real 0m0.001s
user 0m0.000s
sys 0m0.001s
$ time grep -o 'foo' file.txt | wc -l
4
real 0m0.002s
user 0m0.000s
sys 0m0.003s
On large files this can be even more significant. Here I copied my file to a larger time a hundred thousand times:
$ for i in `seq 1 300000`; do cat file.txt >> largefile.txt; done
^C
$ wc -l largefile.txt
1111744 largefile.txt
Now here is how slow is using pipe:
$ time grep -o foo largefile.txt | wc -l
277936
real 0m0.216s
user 0m0.214s
sys 0m0.010s
And here is how fast is only using grep:
$ time grep -c foo largefile.txt
277936
real 0m0.032s
user 0m0.028s
sys 0m0.004s
These benchmarks where done on a machine with Core i5 and plentty of RAM, it would have been significantly on an embeded device with little RAM and CPU resources.
To sum, don't use pipes where you don't need them. Often UNIX tools have overlapping functionalities. Know your tools, read how to use them!
To count the occurence of a word in a file it's enough to use:
$ grep -c <word> <filename>

If you want to generalize to count all words, use:
sort file.txt | uniq -c

Related

Performance of wc -l

I ran the following command :
time for i in {1..100}; do find / -name "*.service" | wc -l; done
got a 100 lines of the result then :
real 0m35.466s
user 0m15.688s
sys 0m14.552s
I then ran the following command :
time for i in {1..100}; do find / -name "*.service" | awk 'END{print NR}'; done
got a 100 lines of the result then :
real 0m35.036s
user 0m15.848s
sys 0m14.056s
I precise I already ran find / -name "*.service" just before so it was cached for both commands.
I expected wc -l to be faster. Why is it not ?
other's have mentioned that you're probably timing find, not wc or awk. still, there may be interesting differences to explore between wc and awk in their various flavors.
here are the results I get:
Mac OS 10.10.5 awk 0.16m lines/second
GNU awk/gawk 4.1.4 4.4m lines/second
Mac OS 10.10.5 wc 6.8m lines/second
GNU wc 8.27 11m lines/second
i didn't use find, but instead used wc -l or `awk 'END{print NR}' on a large text file (66k lines) in a loop.
i varied the order of the commands and didn't find any deviations large enough to change the rankings i reported.
LC_CTYPE=C had no measurable effect on any of these.
conclusions
don't use mac builtin command line tools except for trivial amounts of data.
GNU wc is faster than GNU awk at counting lines.
i use MacPorts GNU binaries. it would be interesting to see how Homebrew binaries compare. (i'm guessing they'd lose.)
Three things:
Such a small difference is usually not significant:
0m35.466s - 0m35.036s = 0m0.43s or 1.2%
Yet wc -l is faster (10x) than awk 'END{print NR}'.
% time seq 100000000 | awk 'END{print NR}' > /dev/null
real 0m13.624s
user 0m14.656s
sys 0m1.047s
% time seq 100000000 | wc -l > /dev/null
real 0m1.604s
user 0m2.413s
sys 0m0.623s
My guess is that the hard drive cache holds the find results, so after the first run with wc -l, most of the reads needed for find are in the cache. Presumably the difference in times between the initial find with disk reads and the second find with cache reads, would be greater than the difference in run times between awk and wc.
One way to test this is to reboot, which clears the hard disk cache, then run the two tests again, but in the reverse order, so that awk is run first. I'd expect that the first-run awk would be even slower than the first-run wc, and the second-run wc would be faster than the second-run awk.

Optimizing search in linux

I have a huge log file close to 3GB in size.
My task is to generate some reporting based on # of times something is being logged.
I need to find the number of time StringA , StringB , StringC is being called separately.
What I am doing right now is:
grep "StringA" server.log | wc -l
grep "StringB" server.log | wc -l
grep "StringC" server.log | wc -l
This is a long process and my script takes close to 10 minutes to complete. What I want to know is that whether this can be optimized or not ? Is is possible to run one grep command and find out the number of time StringA, StringB and StringC has been called individually ?
You can use grep -c instead of wc -l:
grep -c "StringA" server.log
grep can't report count of individual strings. You can use awk:
out=$(awk '/StringA/{a++;} /StringB/{b++;} /StringC/{c++;} END{print a, b, c}' server.log)
Then you can extract each count with a simple bash array:
arr=($out)
echo "StringA="${arr[0]}
echo "StringA="${arr[1]}
echo "StringA="${arr[2]}
This (grep without wc) is certainly going to be faster and possibly awk solution is also faster. But I haven't measured any.
Certainly this approach could be optimized since grep doesn't perform any text indexing. I would use a text indexing engine like one of those from this review or this stackexchange QA . Also you may consider using journald from systemd which stores logs in a structured and indexed format so lookups are more effective.
So many greps so little time... :-)
According to David Lyness, a straight grep search is about 7 times as fast as an awk in large file searches.
If that is the case, the current approach could be optimized by changing grep to fgrep, but only if the patterns being searched for are not regular expressions. fgrep is optimized for fixed patterns.
If the number of instances is relatively small compared to the original log file entries, it may be an improvement to use the egrep version of grep to create a temporary file filled with all three instances:
egrep "StringA|StringB|StringC" server.log > tmp.log
grep "StringA" tmp.log | wc -c
grep "StringB" tmp.log | wc -c
grep "StringC" tmp.log | wc -c
The egrep variant of grep allows for a | (vertical bar/pipe) character to be used between two or more separate search strings so that you can find multiple strings in statement. You can use grep -E to do the same thing.
Full documentation is in the man grep page and information about the Extended Regular Expressions that egrep uses from the man 7 re_format command.

How do I grep multiple lines (output from another command) at the same time?

I have a Linux driver running in the background that is able to return the current system data/stats. I view the data by running a console utility (let's call it dump-data) in a console. All data is dumped every time I run dump-data. The output of the utility is like below
Output:
- A=reading1
- B=reading2
- C=reading3
- D=reading4
- E=reading5
...
- variableX=readingX
...
The list of readings returned by the utility can be really long. Depending on the scenario, certain readings would be useful while everything else would be useless.
I need a way to grep only the useful readings whose names might have have nothing in common (via a bash script). I.e. Sometimes I'll need to collect A,D,E; and other times I'll need C,D,E.
I'm attempting to graph the readings over time to look for trends, so I can't run something like this:
# forgive my pseudocode
Loop
dump-data | grep A
dump-data | grep D
dump-data | grep E
End Loop
to collect A,D,E as that would actually give me readings from 3 separate calls of dump-data as that would not be accurate.
If you want to save all result of grep in the same file, you can just join all expressions in one:
grep -E 'expr1|expr2|expr3'
But if you want to have results (for expr1, expr2 and expr3) in separate files, things are getting more interesting.
You can do this using tee >(command).
For example, here I process the same pipe with thre different commands:
$ echo abc | tee >(sed s/a/_a_/ > file1) | tee >(sed s/b/_b_/ > file2) | sed s/c/_c_/ > file3
$ grep "" file[123]
file1:_a_bc
file2:a_b_c
file3:ab_c_
But the command seems to be too complex.
I would better save dump-data results to a file and then grep it.
TEMP=$(mktemp /tmp/dump-data-XXXXXXXX)
dump-data > ${TEMP}
grep A ${TEMP}
grep B ${TEMP}
grep C ${TEMP}
You can use dump-data | grep -E "A|D|E". Note the -E option of grep. Alternatively you could use egrep without the -E option.
you can simply use:
dump-data | grep -E 'A|D|E'
awk '/MY PATTERN/{print > "matches-"FILENAME;}' myfile{1,3}
thx Guru at Stack Exchange

linux command grep -is "abc" filename|wc -l

what does the s mean there and also when pipe into wc what is that for? I know it eventually count the number of abc appeared in file filename, but not sure about the option s for and also pipe to wc mean
linux command grep -is "abc" filename|wc -l
output
47
-s means "suppress error messages about unreadable files" and the pipe to wc means "take the output and send it to the wc -l command" which effectively counts the number of lines matched. You can accomplish the same with the -c option to grep: grep -isc "abc" filename
Consider,
command_1 | command_2
Role of the pipe is that- it takes output of command written before it (command_1 here) and supplies that output to the command written after it (command_2 here).
The man page has everything you would want to know about the options for grep:
-s, --no-messages
Suppress error messages about nonexistent or unreadable files.
Portability note: unlike GNU grep, traditional grep did not con-
form to POSIX.2, because traditional grep lacked a -q option and
its -s option behaved like GNU grep's -q option. Shell scripts
intended to be portable to traditional grep should avoid both -q
and -s and should redirect output to /dev/null instead.
The pipe to wc -l is what gives you the count of how many lines the string "abc" appeared on. It isn't necessarily the number of times the string appeared in the file since one line with multiple occurrences is going to be counted as only 1.
grep man page says:
-s, --no-messages suppress error messages
grep returns the lines that have abc (case insensitive) in them. You pipe them to wc to get a count of the number of lines.
From man grep:
-s, --no-messages
Suppress error messages about nonexistent or unreadable files.
The wc command counts line, words and characters. With -l it returns the number of lines.

How to append one file to another in Linux from the shell?

I have two files: file1 and file2. How do I append the contents of file2 to file1 so that contents of file1 persist the process?
Use bash builtin redirection (tldp):
cat file2 >> file1
cat file2 >> file1
The >> operator appends the output to the named file or creates the named file if it does not exist.
cat file1 file2 > file3
This concatenates two or more files to one. You can have as many source files as you need. For example,
cat *.txt >> newfile.txt
Update 20130902
In the comments eumiro suggests "don't try cat file1 file2 > file1." The reason this might not result in the expected outcome is that the file receiving the redirect is prepared before the command to the left of the > is executed. In this case, first file1 is truncated to zero length and opened for output, then the cat command attempts to concatenate the now zero-length file plus the contents of file2 into file1. The result is that the original contents of file1 are lost and in its place is a copy of file2 which probably isn't what was expected.
Update 20160919
In the comments tpartee suggests linking to backing information/sources. For an authoritative reference, I direct the kind reader to the sh man page at linuxcommand.org which states:
Before a command is executed, its input and output may be redirected
using a special notation interpreted by the shell.
While that does tell the reader what they need to know it is easy to miss if you aren't looking for it and parsing the statement word by word. The most important word here being 'before'. The redirection is completed (or fails) before the command is executed.
In the example case of cat file1 file2 > file1 the shell performs the redirection first so that the I/O handles are in place in the environment in which the command will be executed before it is executed.
A friendlier version in which the redirection precedence is covered at length can be found at Ian Allen's web site in the form of Linux courseware. His I/O Redirection Notes page has much to say on the topic, including the observation that redirection works even without a command. Passing this to the shell:
$ >out
...creates an empty file named out. The shell first sets up the I/O redirection, then looks for a command, finds none, and completes the operation.
Note: if you need to use sudo, do this:
sudo bash -c 'cat file2 >> file1'
The usual method of simply prepending sudo to the command will fail, since the privilege escalation doesn't carry over into the output redirection.
Try this command:
cat file2 >> file1
Just for reference, using ddrescue provides an interruptible way of achieving the task if, for example, you have large files and the need to pause and then carry on at some later point:
ddrescue -o $(wc --bytes file1 | awk '{ print $1 }') file2 file1 logfile
The logfile is the important bit. You can interrupt the process with Ctrl-C and resume it by specifying the exact same command again and ddrescue will read logfile and resume from where it left off. The -o A flag tells ddrescue to start from byte A in the output file (file1). So wc --bytes file1 | awk '{ print $1 }' just extracts the size of file1 in bytes (you can just paste in the output from ls if you like).
As pointed out by ngks in the comments, the downside is that ddrescue will probably not be installed by default, so you will have to install it manually. The other complication is that there are two versions of ddrescue which might be in your repositories: see this askubuntu question for more info. The version you want is the GNU ddrescue, and on Debian-based systems is the package named gddrescue:
sudo apt install gddrescue
For other distros check your package management system for the GNU version of ddrescue.
Another solution:
tee < file1 -a file2
tee has the benefit that you can append to as many files as you like, for example:
tee < file1 -a file2 file3 file3
will append the contents of file1 to file2, file3 and file4.
From the man page:
-a, --append
append to the given FILEs, do not overwrite
Zsh specific: You can also do this without cat, though honestly cat is more readable:
>> file1 < file2
The >> appends STDIN to file1 and the < dumps file2 to STDIN.
cat can be the easy solution but that become very slow when we concat large files, find -print is to rescue you, though you have to use cat once.
amey#xps ~/work/python/tmp $ ls -lhtr
total 969M
-rw-r--r-- 1 amey amey 485M May 24 23:54 bigFile2.txt
-rw-r--r-- 1 amey amey 485M May 24 23:55 bigFile1.txt
amey#xps ~/work/python/tmp $ time cat bigFile1.txt bigFile2.txt >> out.txt
real 0m3.084s
user 0m0.012s
sys 0m2.308s
amey#xps ~/work/python/tmp $ time find . -maxdepth 1 -type f -name 'bigFile*' -print0 | xargs -0 cat -- > outFile1
real 0m2.516s
user 0m0.028s
sys 0m2.204s

Resources