How to accelerate substitution when using GNU sed with GNU find?

How to accelerate substitution when using GNU sed with GNU find? - linux

I have the results of a numerical simulation that consist of hundreds of directories; each directory contains millions of text files.
I need to substitute a the string "wavelength;" with "wavelength_bc;" so I have tried both the following:
find . -type f -exec sed -i 's/wavelength;/wavelength_bc;/g' {} \;
and
find . -type f -exec sed -i 's/wavelength;/wavelength_bc;/g' {} +
Unfortunately, the commands above take a very long time to finish, (more than 1 hour).
I wonder how can I take advantage of the number of cores on my machine (8) to accelerate the command above?
I am thinking of using xargs with -P flag. I'm scared that that will corrupt the files; so I have no idea if that is safe or not?
In summary:
How can I accelerate sed substitutions when using with find?
Is it safe to uses xargs -P to run that in parallel?
Thank you

xargs -P should be safe to use, however you will need to use -print0 option of find and piping to xargs -0 to address filenames with spaces or wildcards:
find . -type f -print0 |
xargs -0 -I {} -P 0 sed -i 's/wavelength;/wavelength_bc;/g' {}
-P 0 option in xargs will run in Parallel mode. It will run as many processes as possible for your CPU.

This might work for you (GNU sed & parallel):
find . -type f | parallel -q sed -i 's/wavelength;/wavelength_bc;/g' {}
GNU parallel will run as many jobs as there are cores on the machine in parallel.
More sophisticated uses can involve remote servers and file transfer see here and a cheatsheet here.

Related

How to grep through many files of same file type

I wish to grep through many (20,000) text files, each with about 1,000,000 lines each, so the faster the better.
I have tried the below code and it just doesn't seem to want to do anything, it doesn't find any matches even after an hour (it should have done by now).
for i in $(find . -name "*.txt"); do grep -Ff firstpart.txt $1; done

Ofir's answer is good. Another option:
find . -name "*.txt" -exec grep -fnFH firstpart.txt {} \;
I like to add the -n for line numbers and -H to get the filename. -H is particularly useful in this case as you could have a lot of matches.

Instead of iterating through the files in a loop, you can just give the file names to grep using xargs and let grep go over all the files.
find . -name "*.txt" | xargs grep $1
I'm not quite sure whether it will actually increase the performance, but it's probably worth a try.

ripgrep is the most amazing tool. You should get that and use it.
To search *.txt files in all directories recursively, do this:
rg -t txt -f patterns.txt
Ripgrep uses one of the fastest regular expression engines out there. It uses multiple threads. It searches directories and files, and filters them to the interesting ones in the fastest way.
It is simply great.
For anyone stuck using grep for whatever reason:
find -name '*.txt' -type f -print0 | xargs -0 -P 8 -n 8 grep -Ff patterns.txt
That tells xargs to -n 8 use 8 arguments per command and to -P 8 run 8 copies in parallel. It has the downside that the output might become interleaved and corrupted.
Instead of xargs you could use parallel which does a fancier job and keeps output in order:
$ find -name '*.txt' -type f -print0 | parallel -0 grep --with-filename grep -Ff patterns.txt

Linux Shell Command: Find. How to Sort and Exec without using Pipes?

Linux command find with argument exec does a GREAT job executing commands on files/folders regardless whether they contain spaces and special characters. For example:
find . -type f -exec md5sum {} \;
Works great to run md5sum on each file in a directory tree, but executes in a random order. Find does not sort the results, and requires piping to sort to get results in a more human-readable ordering. However, piping to sort eliminates the benefits of exec.
This does not work:
find . -type f | sort | md5sum
Because some filenames contain spaces and special characters.
Also does not work:
find . -type f | sort | sed 's/ /\\ /g' | md5sum
Still does not recognize spaces are part of the filename.
I suppose I can always sort the final result later, but wonder if someone knows an easy way to avoid that extra step by sorting within find?

With BSD find
A -s argument is available to request lexographic sort order.
find . -s -type f -exec md5sum -- '{}' +
With GNU find
Use NUL delimiters to allow filenames to be processed unambiguously. Assuming you have GNU tools:
find . -type f -print0 | sort -z | xargs -0 md5sum

Found a working solution
find . -type f -exec md5sum {} + | sort -k 1.33
Sorts the results by comparing the characters starting after the 32 character md5sum result, producing a readable/sorted list.

Optimal string replacing in files for AIX

I need to remove about 40 emails from several files in a distribution list.
One Address might appear in different files and need to be removed from all of them.
I am working in a directory with several .sh files which also have several lines.
I have done something like this in a couple of test files:
find . -type f -exec grep -li ADDRESS_TO_FIND {} 2>/dev/null \; | xargs sed -i 's/ADDRESS_TO_REMOVE/ /g' *
It works fine but once I try it in the real files, it takes a long time and just sits there. I need to run this in different servers, this is the main cause I want to optimize this.
I have tried to run something like this:
find . -type f -name '*sh' 2>/dev/null | xargs grep ADDRESS_TO_FIND
but that will return:
./FileContainingAddress.sh:ADDRESS_TO_FIND
How do I add something like this:
awk '{print substr($0,1,10)}'
But to return me everything before the ":"?
I can do the rest from there, but haven't found how to trim that part

You can use -exec as a predicate in find, as long as you don't use the multiple file + version, which means that you can provide several -exec clauses each of which will be dependent on the success of the previous one. This style will avoid the construction of lists of filenames, which makes it much more robust in the face of files with odd characters in their names.
For example:
find . -type f -name '*sh' \
-exec grep -qi ADDRESS_TO_FIND {} \; \
-exec sed -i 's/ADDRESS_TO_FIND/ /g' {} \;
You probably want to provide the address as a parameter rather than having to type it twice, unless you really meant for the two instance to be different (ADDRESS_TO_FIND vs. ADDRESS_TO_REMOVE):
clean() {
find . -type f -name '*sh' \
-exec grep -qi "$1" {} \; \
-exec sed -i "s/$1/ /g" {} \;
}
(Watch out for / in the argument to clean. I'll leave making the sed more robust as an exercise.)

After looking back at your question, I noticed something that's potentially quite important:
find -type f -exec grep -li ADDRESS {} \; | xargs sed -i 's/ADDRESS/ /g' *
# here! -----------------------------------------------------------------^
The asterisk is being expanded, so the sed line is operating on every file in the directory.
Assuming that this wasn't a typo in your question, I believe that this is the source of your poor performance. You should remove it!

Fastest way to grep through thousands of gz files?

I have thousands of .gz files all in one directory. I need to grep through them for the string Mouse::Handler, is the following the fastest (and most accurate) way to do this?
find . -name "*.gz" -exec zgrep -H 'Mouse::Handler' {} \;
Ideally I would also like to print out the line that I find this string on.
I'm running on a RHEL linux box.

You can search in parallel using
find . -name "*.gz" | xargs -n 1 -P NUM zgrep -H 'Mouse::Handler'
where NUM is around the number of cores you have.

How do I include a pipe | in my linux find -exec command?

This isn't working. Can this be done in find? Or do I need to xargs?
find -name 'file_*' -follow -type f -exec zcat {} \| agrep -dEOE 'grep' \;

the solution is easy: execute via sh
... -exec sh -c "zcat {} | agrep -dEOE 'grep' " \;

The job of interpreting the pipe symbol as an instruction to run multiple processes and pipe the output of one process into the input of another process is the responsibility of the shell (/bin/sh or equivalent).
In your example you can either choose to use your top level shell to perform the piping like so:
find -name 'file_*' -follow -type f -exec zcat {} \; | agrep -dEOE 'grep'
In terms of efficiency this results costs one invocation of find, numerous invocations of zcat, and one invocation of agrep.
This would result in only a single agrep process being spawned which would process all the output produced by numerous invocations of zcat.
If you for some reason would like to invoke agrep multiple times, you can do:
find . -name 'file_*' -follow -type f \
-printf "zcat %p | agrep -dEOE 'grep'\n" | sh
This constructs a list of commands using pipes to execute, then sends these to a new shell to actually be executed. (Omitting the final "| sh" is a nice way to debug or perform dry runs of command lines like this.)
In terms of efficiency this results costs one invocation of find, one invocation of sh, numerous invocations of zcat and numerous invocations of agrep.
The most efficient solution in terms of number of command invocations is the suggestion from Paul Tomblin:
find . -name "file_*" -follow -type f -print0 | xargs -0 zcat | agrep -dEOE 'grep'
... which costs one invocation of find, one invocation of xargs, a few invocations of zcat and one invocation of agrep.

find . -name "file_*" -follow -type f -print0 | xargs -0 zcat | agrep -dEOE 'grep'

You can also pipe to a while loop that can do multiple actions on the file which find locates. So here is one for looking in jar archives for a given java class file in folder with a large distro of jar files
find /usr/lib/eclipse/plugins -type f -name \*.jar | while read jar; do echo $jar; jar tf $jar | fgrep IObservableList ; done
the key point being that the while loop contains multiple commands referencing the passed in file name separated by semicolon and these commands can include pipes. So in that example I echo the name of the matching file then list what is in the archive filtering for a given class name. The output looks like:
/usr/lib/eclipse/plugins/org.eclipse.core.contenttype.source_3.4.1.R35x_v20090826-0451.jar
/usr/lib/eclipse/plugins/org.eclipse.core.databinding.observable_1.2.0.M20090902-0800.jar
org/eclipse/core/databinding/observable/list/IObservableList.class
/usr/lib/eclipse/plugins/org.eclipse.search.source_3.5.1.r351_v20090708-0800.jar
/usr/lib/eclipse/plugins/org.eclipse.jdt.apt.core.source_3.3.202.R35x_v20091130-2300.jar
/usr/lib/eclipse/plugins/org.eclipse.cvs.source_1.0.400.v201002111343.jar
/usr/lib/eclipse/plugins/org.eclipse.help.appserver_3.1.400.v20090429_1800.jar
in my bash shell (xubuntu10.04/xfce) it really does make the matched classname bold as the fgrep highlights the matched string; this makes it really easy to scan down the list of hundreds of jar files that were searched and easily see any matches.
on windows you can do the same thing with:
for /R %j in (*.jar) do #echo %j & #jar tf %j | findstr IObservableList
note that in that on windows the command separator is '&' not ';' and that the '#' suppresses the echo of the command to give a tidy output just like the linux find output above; although findstr is not make the matched string bold so you have to look a bit closer at the output to see the matched class name. It turns out that the windows 'for' command knows quite a few tricks such as looping through text files...
enjoy

I found that running a string shell command (sh -c) works best, for example:
find -name 'file_*' -follow -type f -exec bash -c "zcat \"{}\" | agrep -dEOE 'grep'" \;

If you are looking for a simple alternative, this can be done using a loop:
for i in $(find -name 'file_*' -follow -type f); do
zcat $i | agrep -dEOE 'grep'
done
or, more general and easy to understand form:
for i in $(YOUR_FIND_COMMAND); do
YOUR_EXEC_COMMAND_AND_PIPES
done
and replace any {} by $i in YOUR_EXEC_COMMAND_AND_PIPES

Here's what you should do:
find -name 'file_*' -follow -type f -exec sh -c 'zcat "$1" | agrep -dEOE "grep"' sh {} \;
I tried a couple of these answers and they didn't work for me. #flolo's answer doesn't work correctly if your filenames have special characters. According to this answer:
The find command executes the command directly. The command, including the filename argument, will not be processed by the shell or anything else that might modify the filename. It's very safe.
You lose that safety if you put the {} inside the sh command string.
There is a potential problem with #Rolf W. Rasmussen's answer. Yes, it handles special characters (as far as I know), but if the find output is too long, you won't be able to execute xargs -0 ...: there is a command line character limit set by the kernel and sometimes your shell. Coincidentally, every time I want to pipe commands from a find, I run into this limit.
But, they do bring up a valid point regarding the performance limitations. I'm not sure how to overcome that, though personally, I've never run into a situation where my suggestion is too slow.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to accelerate substitution when using GNU sed with GNU find? - linux

This might work for you (GNU sed & parallel): find . -type f | parallel -q sed -i 's/wavelength;/wavelength_bc;/g' {} GNU parallel will run as many jobs as there are cores on the machine in parallel. More sophisticated uses can involve remote servers and file transfer see here and a cheatsheet here.

Related

How to grep through many files of same file type

Linux Shell Command: Find. How to Sort and Exec without using Pipes?

Optimal string replacing in files for AIX

Fastest way to grep through thousands of gz files?

How do I include a pipe | in my linux find -exec command?

Categories

Resources