Ubuntu terminal - using gnu parallel to read lines in all files in folder - multithreading

I am Trying to count the lines in all the files in a very large folder under Ubuntu.
The files are .gz files and I use
zcat * | wc -l
to count all the lines in all the files, and it's slow!
I want to use multi core computing for this task and found this
about Gnu parallel,
I tried to use this bash command:
parallel zcat * | parallel --pipe wc -l
and the cores are not all working
I found that the job starting might cause major overhead and tried using batching with
parallel -X zcat * | parallel --pipe -X wc -l
without improvenemt,
how can I use all the cores to count the lines in all the files in a folder given they are all .gz files and need to be decompresses before counting the rows (don't need to keep them uncompressed after)
Thanks!

If you have 150,000 files, you will likely get problems with "argument list too long". You can avoid that like this:
find . -name \*gz -maxdepth 1 -print0 | parallel -0 ...
If you want the name beside the line count, you will have to echo it yourself, since your wc process will only be reading from its stdin and won't know the filename:
find ... | parallel -0 'echo {} $(zcat {} | wc -l)'
Next, we come to efficiency and it will depend on what your disks are capable of. Maybe try with parallel -j2 then parallel -j4 and see what works on your system.
As Ole helpfully points out in the comments, you can avoid having to output the name of the file whose lines are being counted by using GNU Parallel's --tag option to tag output line, so this is even more efficient:
find ... | parallel -0 --tag 'zcat {} | wc -l'

Basically the command you are looking for is:
ls *gz | parallel 'zcat {} | wc -l'
What it does is:
ls *gzlist all gz files on stdout
Pipe it to parallel
Spawn subshells with parallel
Run in said subshells the command inside quotes 'zcat {} | wc -l'
About the '{}', according to the manual:
This replacement string will be replaced by a full line read from the input source
So each line piped to parallel get fed to zcat.
Of course this is basic, I assume it could be tuned, the documentation and examples might help

Related

Total number of lines in a directory

I have a directory with thousands of files (100K for now). When I use wc -l ./*, I'll get:
c1 ./test1.txt
c2 ./test2.txt
...
cn ./testn.txt
c1+c2+...+cn total
Because there are a lot of files in the directory, I just want to see the total count and not the details. Is there any way to do so?
I tried several ways and I got following error:
Argument list too long
If what you want is the total number of lines and nothing else, then I would suggest the following command:
cat * | wc -l
This catenates the contents of all of the files in the current working directory and pipes the resulting blob of text through wc -l.
I find this to be quite elegant. Note that the command produces no extraneous output.
UPDATE:
I didn't realize your directory contained so many files. In light of this information, you should try this command:
for file in *; do cat "$file"; done | wc -l
Most people don't know that you can pipe the output of a for loop directly into another command.
Beware that this could be very slow. If you have 100,000 or so files, my guess would be around 10 minutes. This is a wild guess because it depends on several parameters that I'm not able to check.
If you need something faster, you should write your own utility in C. You could make it surprisingly fast if you use pthreads.
Hope that helps.
LAST NOTE:
If you're interested in building a custom utility, I could help you code one up. It would be a good exercise, and others might find it useful.
Credit: this builds on #lifecrisis's answer, and extends it to handle large numbers of files:
find . -maxdepth 1 -type f -exec cat {} + | wc -l
find will find all of the files in the current directory, break them into groups as large as can be passed as arguments, and run cat on the groups.
awk 'END {print NR" total"}' ./*
Would be an interesting comparison to find out how many lines don't end with a new line.
Combining the awk and Gordon’s find solutions and avoiding the "." files.
find ./* -maxdepth 0 -type f -exec awk 'END {print NR}' {} +
No idea if this is better or worse but it does give a more accurate count (for me) and does not count lines in "." files. Using ./* is just a guess that appears to work.
Still need depth and ./* requires "0" depth.
I did get the same result with the "cat" and "awk" solutions (using the same find) since the "cat *" takes care of the new line issue. I don't have a directory with enough files to measure time. Interesting, I'm liking the "cat" solution.
This will give you the total count for all the files (including hidden files) in your current directory :
$ find . -maxdepth 1 -type f | xargs wc -l | grep total
1052 total
To count for files excluding hidden files use :
find . -maxdepth 1 -type f -not -path "*/\.*" | xargs wc -l | grep total
(Apologies for adding this as an answer—but I do not have enough reputation for commenting.)
A comment on #lifecrisis's answer. Perhaps cat is slowing things down a bit. We could replace cat by wc -l and then use awkto add the numbers. (This could be faster since much less data needs to go throught the pipe.)
That is
for file in *; do wc -l "$file"; done | awk '{sum += $1} END {print sum}'
instead of
for file in *; do cat "$file"; done | wc -l
(Disclaimer: I am not incorporating many of the improvements in other answers, but I thought the point was valid enough to write down.)
Here are my results for comparison (I ran the newer version first so that any cache effects would go against the newer candidate).
$ time for f in `seq 1 1500`; do head -c 5M </dev/urandom >myfile-$f |sed -e 's/\(................\)/\1\n/g'; done
real 0m50.360s
user 0m4.040s
sys 0m49.489s
$ time for file in myfile-*; do wc -l "$file"; done | awk '{sum += $1} END {print sum}'
30714902
real 0m3.455s
user 0m2.093s
sys 0m1.515s
$ time for file in myfile-*; do cat "$file"; done | wc -l
30714902
real 0m4.481s
user 0m2.544s
sys 0m4.312s
iF you want to know only total number Lines in directory excluding total line
ls -ltr | sed -n '/total/!p' | awk '{print NR}'
Previous comment will give total count of lines which includes only count of lines in all files
Below command will provide the total count of lines from all files in path
for i in `ls- ltr | awk ‘$1~”^-rw”{print $9}’`; do wc -l $I | awk ‘{print $1}’; done >>/var/tmp/filelinescount.txt
Cat /var/tmp/filelinescount.txt| sed -r “s/\s+//g”|tr “\n” “+”| sed “s:+$::g”| sed ’s/^/“/g’| sed ’s/$/“/g’ | awk ‘{print “echo” “ “ $0”+bc”}’| sh

How do I use the pipe command to display attributes in a file?

I'm currently making a shell program and I want to display the total amount of bytes in a specific file using the pipe command. I know that the pipe command takes whatever is on the left side and gives it to the right as input. (Assuming you are in the directory the file is in)
I know that the command (wc -c) displays the number of bytes in a file but I'm not sure how to pipe it. What I've tried was:
ls fileName.sh | wc -c
wc takes the filename as argument, not as input. Try this:
wc -c fileName.sh
The wc program takes multiple arguments. You can do this to apply it to all entries in the current working directory:
wc -c $(ls)
Another approach is to use xargs to convert input to arguments:
ls | xargs wc -c
You may need to use a more complex line if you have spaces in your filenames. ls can output a single file per line, and xargs can be told to split only on \n:
ls -1 | xargs -d '\n' wc -c
If you prefer to use find instead of ls (a more powerful tool), the -print0 option for find plays along with the -0 option to xargs.

Can find push the filenames of the found files into the pipe?

I would like to do a find in some dir, and do a awk on the files in this direcory, and then replace the original files by each result.
find dir | xargs cat | awk ... | mv ... > filename
So I need the filename (of each of the files found by find) in the last command. How can I do that?
I would use a loop, like:
for filename in `find . -name "*test_file*" -print0 | xargs -0`
do
# some processing, then
echo "what you like" > "$filename"
done
EDIT: as noted in the comments, the benefits of -print0 | xargs -0 are lost because of the for loop. And filenames containing a white space are still not handled correctly.
The following while loop would not handle unusual filenames neither (good to know it, though it was not in the question), but filenames with a standard white space at least, so it works better, indeed:
find . -name "*test*file*" -print > files_list
while IFS= read -r filename
do
# some process
echo "what you like" > "$filename"
done < files_list
You could do something like this (but I wouldn't recommend it at all).
find dir -print0 |
xargs -0 -n 2 awk -v OFS='\0' '<process the input and write to temporary file>
END {print "temporaryfile", FILENAME}' |
xargs -0 -n 2 mv
This passes the files to awk directly two at a time (which avoids the problem with your original where cat will get hundreds (perhaps more) files as arguments all at once and spit all their content at awk via standard input at once and thus lose their individual contents and filenames entirely).
It then has awk write the processed output to a temporary file and then outputs the temporary filename and the original filename where xargs picks them up (again two at a time) and runs mv on the pairs of temporary file/original file names.
As I said at the beginning however this is a terrible way to do this.
If you have a new enough version of GNU awk (version 4.1.0 or newer) then you could just use the -i (in-place) argument to awk and use (I believe):
find dir | xargs awk -i '......'
Without that I would use a while loop of the form in Bash FAQ 001 to read the find output line-by-line and operate on it in the loop.

Optimizing search in linux

I have a huge log file close to 3GB in size.
My task is to generate some reporting based on # of times something is being logged.
I need to find the number of time StringA , StringB , StringC is being called separately.
What I am doing right now is:
grep "StringA" server.log | wc -l
grep "StringB" server.log | wc -l
grep "StringC" server.log | wc -l
This is a long process and my script takes close to 10 minutes to complete. What I want to know is that whether this can be optimized or not ? Is is possible to run one grep command and find out the number of time StringA, StringB and StringC has been called individually ?
You can use grep -c instead of wc -l:
grep -c "StringA" server.log
grep can't report count of individual strings. You can use awk:
out=$(awk '/StringA/{a++;} /StringB/{b++;} /StringC/{c++;} END{print a, b, c}' server.log)
Then you can extract each count with a simple bash array:
arr=($out)
echo "StringA="${arr[0]}
echo "StringA="${arr[1]}
echo "StringA="${arr[2]}
This (grep without wc) is certainly going to be faster and possibly awk solution is also faster. But I haven't measured any.
Certainly this approach could be optimized since grep doesn't perform any text indexing. I would use a text indexing engine like one of those from this review or this stackexchange QA . Also you may consider using journald from systemd which stores logs in a structured and indexed format so lookups are more effective.
So many greps so little time... :-)
According to David Lyness, a straight grep search is about 7 times as fast as an awk in large file searches.
If that is the case, the current approach could be optimized by changing grep to fgrep, but only if the patterns being searched for are not regular expressions. fgrep is optimized for fixed patterns.
If the number of instances is relatively small compared to the original log file entries, it may be an improvement to use the egrep version of grep to create a temporary file filled with all three instances:
egrep "StringA|StringB|StringC" server.log > tmp.log
grep "StringA" tmp.log | wc -c
grep "StringB" tmp.log | wc -c
grep "StringC" tmp.log | wc -c
The egrep variant of grep allows for a | (vertical bar/pipe) character to be used between two or more separate search strings so that you can find multiple strings in statement. You can use grep -E to do the same thing.
Full documentation is in the man grep page and information about the Extended Regular Expressions that egrep uses from the man 7 re_format command.

How do I grep in parallel

I usually use grep -rIn pattern_str big_source_code_dir to find some thing. but the grep is not parallel, how do I make it parallel? My system has 4 cores, if the grep can use all the cores, it would be faster.
There will not be speed improvement if you are using a HDD to store that directory you are searching in. Hard drives are pretty much single-threaded access units.
But if you really want to do parallel grep, then this website gives two hints of how to do it with find and xargs. E.g.
find . -type f -print0 | xargs -0 -P 4 -n 40 grep -i foobar
The GNU parallel command is really useful for this.
sudo apt-get install parallel # if not available on debian based systems
Then, paralell man page provides an example:
EXAMPLE: Parallel grep
grep -r greps recursively through directories.
On multicore CPUs GNU parallel can often speed this up.
find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}
This will run 1.5 job per core, and give 1000 arguments to grep.
In your case it could be:
find big_source_code_dir -type f | parallel -k -j150% -n 1000 -m grep -H -n pattern_str {}
Finally, the GNU parallel man page also provides a section describing differences betwenn xargs and parallel command, that should help understanding why parallel seems better in your case
DIFFERENCES BETWEEN xargs AND GNU Parallel
xargs offer some of the same possibilities as GNU parallel.
xargs deals badly with special characters (such as space, ' and "). To see the problem try this:
touch important_file
touch 'not important_file'
ls not* | xargs rm
mkdir -p "My brother's 12\" records"
ls | xargs rmdir
You can specify -0 or -d "\n", but many input generators are not optimized for using NUL as separator but are optimized for newline as separator. E.g head, tail, awk, ls, echo, sed, tar -v, perl (-0 and \0 instead of \n),
locate (requires using -0), find (requires using -print0), grep (requires user to use -z or -Z), sort (requires using -z).
So GNU parallel's newline separation can be emulated with:
cat | xargs -d "\n" -n1 command
xargs can run a given number of jobs in parallel, but has no support for running number-of-cpu-cores jobs in parallel.
xargs has no support for grouping the output, therefore output may run together, e.g. the first half of a line is from one process and the last half of the line is from another process. The example Parallel grep cannot be
done reliably with xargs because of this.
...
Note that you need to escape special characters in your parallel grep search term, for example:
parallel --pipe --block 10M --ungroup LC_ALL=C grep -F 'PostTypeId=\"1\"' < ~/Downloads/Posts.xml > questions.xml
Using standalone grep, grep -F 'PostTypeId="1"' would work without escaping the double quotes. It took me a while to figure that out!
Also note the use of LC_ALL=C and the -F flag (if you're just searching full strings) for additional speed-ups.
Here are 3 ways to do it, but you can't get line number for two of them.
(1) Run grep on multiple files in parallel, in this case all files in a directory and its subdirectories. Add /dev/null to force grep to prepend the filename to the matching line, because you're gonna want to know what file matched. Adjust the number of process -P for your machine.
find . -type f | xargs -n 1 -P 4 grep -n <grep-args> /dev/null
(2) Run grep on multiple files in serial but process 10M blocks in parallel. Adjust the block size for your machine and files. Here are two ways to do that.
# for-loop
for filename in `find . -type f`
do
parallel --pipepart --block 10M -a $filename -k "grep <grep-args> | awk -v OFS=: '{print \"$filename\",\$0}'"
done
# using xargs
find . -type f | xargs -I filename parallel --pipepart --block 10M -a filename -k "grep <grep-args> | awk -v OFS=: '{print \"filename\",\$0}'"
(3) Combine (1) and (2): run grep on multiple files in parallel and process their contents in blocks in parallel. Adjust block size and xargs parallelism for your machine.
find . -type f | xargs -n 1 -P 4 -I filename parallel --pipepart --block 10M -a filename -k "grep <grep-args> | awk -v OFS=: '{print \"filename\",\$0}'"
Beware that (3) may not be the best use of resources.
I've got a longer write-up, but that's the basic idea.

Resources