How to make many edits to files, without writing to the harddrive very often, in BASH? - linux

I often need to make many edits to text files. The files are typically 20 MB in size and require ~500,000 individual edits, all which must be made in a very specific order. Here is a simple example of a script I might need to use:
while read -r line
do
...
(20-100 lines of BASH commands preparing $a and $b)
...
sed -i "s/$a/$b/g" ./editfile.txt
...
done < ./readfile.txt
As many other lines of code appear before and after the sed script, it seems the only option for editing the file is sed with the -i option. Many have warned me against using sed -i, as that makes too many writes to the file. Recently, I had to replace two computers, as the hard drives stopped working after running the scripts. I need to find a solution that does not damage my computer's hardware.
Is there some way to send files somewhere else, such as storing the whole file into a BASH variable, or into RAM, where I sed, grep, and awk, can make the edits without making millions of writes to the hard drive?

Don't use sed -i once per transform. A far better approach -- leaving you with more control -- is to construct a pipeline (if you can't use a single sed with multiple -e arguments to perform multiple operations within a single instance), and redirect to or from disk at only the beginning and end.
This can even be done recursively, if you use a FD other than stdin for reading from your file:
editstep() {
read -u 3 -r line # read from readfile into REPLY
if [[ $REPLY ]]; then # we read something new from readfile
sed ... | editstep # perform the edits, then a recursive call!
else
cat
fi
}
editstep <editfile.txt >editfile.txt.new 3<readfile.txt
Better than that, though, is to consolidate to a single sed instance.
sed_args=( )
while read -r line; do
sed_args+=( -e "s/in/out/" )
done <readfile.txt
sed -i "${sed_args[#]}" editfile.txt
...or, for edit lists too long to pass in on the command line:
sed_args=( )
while read -r line; do
sed_args+=( "s/in/out/" )
done <readfile.txt
sed -i -f <(printf '%s\n' "${sed_args[#]}") editfile.txt
(Please don't read the above as an endorsement of sed -i, which is a non-POSIX extension and has its own set of problems; the POSIX-specified editor intended for in-place rather than streaming operations is ex, not sed).
Even better? Don't use sed at all, but keep all the operations inline in native bash.
Consider the following:
content=$(<editfile.txt)
while IFS= read -r; do
# put your own logic here to set `in` and `out`
content=${content//$in/$out}
done <readfile.txt
printf '%s\n' "$content" >editfile.new
One important caveat: This approach treats in as a literal string, not a regular expression. Depending on the edits you're actually making, this may actually improve correctness over the original code... but in any event, it's worth being aware of.
Another caveat: Reading the file's contents into a bash string is not necessarily a lossless operation; expect content to be truncated at the first NUL byte (if any exist), and a trailing newline to be added at the end of the file if none existed before.

simple ...
instead of trying too many threads, you can simple copy all your files and dirs to /dev/shm
This is representation of ram drive. When you are done editing, copy all back to the original destination. Do not forget to run sync after you are done :-)

Related

Improve performance of Bash loop that removes windows line endings

Editor's note: This question was always about loop performance, but the original title led some answerers - and voters - to believe it was about how to remove Windows line endings.
The below bash loop below just remove the windows line endings and converts them to unix and appears to be running, but it is slow. The input files are small (4 files ranging from 167 bytes - 1 kb), and are all the same structure (list of names) and the only thing that varies is the length (ie. some files are 10 names others are 50). Is it supposed to take over 15 minutes to complete this task using a xeon processor? Thank you :)
for f in /home/cmccabe/Desktop/files/*.txt ; do
bname=`basename $f`
pref=${bname%%.txt}
sed 's/\r//' $f - $f > /home/cmccabe/Desktop/files/${pref}_unix.txt
done
Input .txt files
AP3B1
BRCA2
BRIP1
CBL
CTC1
EDIT
This is not a duplicate as I was more asking for why my bash loop that uses sed to remove windows line endings was running so slow. I did not mean to imply how to remove them, was asking for ideas that might speed up the loop and I got many. Thank you :). I hope this helps.
Use the utilities dos2unix and unix2dos to convert between unix and windows style line endings.
Your 'sed' command looks wrong. I believe the trailing $f - $f should simply be $f. Running your script as written hangs for a very long time on my system, but making this change causes it to complete almost instantly.
Of course, the best answer is to use dos2unix, which was designed to handle this exact thing:
cd /home/cmccabe/Desktop/files
for f in *.txt ; do
pref=$(basename -s '.txt' "$f")
dos2unix -q -n "$f" "${pref}_unix.txt"
done
This always works for me:
perl -pe 's/\r\n/\n/' inputfile.txt > outputfile.txt
you can use dos2unix as stated before or use this small sed:
sed 's/\r//' file
The key to performance in Bash is to avoid loops in general, and in particular those that call one or more external utilities in each iteration.
Here is a solution that uses a single GNU awk command:
awk -v RS='\r\n' '
BEGINFILE { outFile=gensub("\\.txt$", "_unix&", 1, FILENAME) }
{ print > outFile }
' /home/cmccabe/Desktop/files/*.txt
-v RS='\r\n' sets CRLF as the input record separator, and by virtue of leaving ORS, the output record separator at its default, \n, simply printing each input line will terminate it with \n.
the BEGINFILE block is executed every time processing of a new input file starts; in it, gensub() is used to insert _unix before the .txt suffix of the input file at hand to form the output filename.
{print > outFile} simply prints the \n-terminated lines to the output file at hand.
Note that use of a multi-char. RS value, the BEGINFILE block, and the gensub() function are GNU extensions to the POSIX standard.
Switching from the OP's sed solution to a GNU awk-based one was necessary in order to provide a single-command solution that is both simpler and faster.
Alternatively, here's a solution that relies on dos2unix for conversion of Window line-endings (for instance, you can install dos2unix with sudo apt-get install dos2unix on Debian-based systems); except for requiring dos2unix, it should work on most platforms (no GNU utilities required):
It uses a loop only to construct the array of filename arguments to pass to dos2unix - this should be fast, given that no call to basename is involved; Bash-native parameter expansion is used instead.
then uses a single invocation of dos2unix to process all files.
# cd to the target folder, so that the operations below do not need to handle
# path components.
cd '/home/cmccabe/Desktop/files'
# Collect all *.txt filenames in an array.
inFiles=( *.txt )
# Derive output filenames from it, using Bash parameter expansion:
# '%.txt' matches '.txt' at the end of each array element, and replaces it
# with '_unix.txt', effectively inserting '_unix' before the suffix.
outFiles=( "${inFiles[#]/%.txt/_unix.txt}" )
# Create an interleaved array of *input-output filename pairs* to be passed
# to dos2unix later.
# To inspect the resulting array, run `printf '%s\n' "${fileArgs[#]}"`
# You'll see pairs like these:
# file1.txt
# file1_unix.txt
# ...
fileArgs=(); i=0
for inFile in "${inFiles[#]}"; do
fileArgs+=( "$inFile" "${outFiles[i++]}" )
done
# Now, use a *single* invocation of dos2unix, passing all input-output
# filename pairs at once.
dos2unix -q -n "${fileArgs[#]}"

Delete entry from /etc/fstab using sed

I am scripting a solution wherein, when a shared drive is removed from the server, we need to remove the entry from fstab as well.
What I have done till now :
MP="${ServerAddress}:${DirectoryPath} ${MountPoint} nfs defaults 0 0"
while read line; do
if [ "${MP}" = "${line}" ]; then
sed "/${line}/d"
fi
done < /etc/fstab
this is giving an error >> sed: 1: "/servername.net...": command I expects \ followed by text
Please suggest on this can be deleted.
PS: I am running this as a part of the script, so i dont have to run this individually. While running with the suggested options, I am able to delete.. but during the script, this does not work and gives that error. People commenting on the " or the formatting, its just that I cannot copy from there since that is a remote through terminal server.
Try the following:
sed -i.bak "\#^$SERVERADDR:$DIRPATH#d" /etc/fstab
After setting meaningful values for SERVERADDR and DIRPATH. That line will also make a backup of the old file (named fstab.bak). But since /etc/fstab is such an important file, please make sure to have more backups handy.
Let me point out that you only need that single line, no while loop, no script!
Note that the shell is case-sensitive and expects " as double quotes and not ” (a problem now fixed in the question). You need to configure your editor not to capitalize words randomly for you (also now fixed). Note that experimenting on system configuration files is A Bad Idea™. Make a copy of the file and experiment on the copy. Only when you're sure that the script works on the copy do you risk the live file. And then you make a backup of the live file before making the changes. (And avoid doing the experimenting as root if at all possible; that limits the damage you can do.)
Your script as written might as well not use sed. Change the comparison to != and simply use echo "$line" to echo the lines you want to standard output.
MP="${ServerAddress}:${DirectoryPath} ${MountPoint} nfs defaults 0 0"
while read line; do
if [ "${MP}" != "${line}" ]; then
echo "$line"
fi
done < /etc/fstab
Alternatively, you can use sed in a single command. The only trick is that the pattern probably contains slashes, so the \# notation says 'use # as the search marker'.
MP="${ServerAddress}:${DirectoryPath} ${MountPoint} nfs defaults 0 0"
sed -e "\#$MP#d" /etc/fstab
If you have GNU sed, when you're satisfied with that, you can add the -i.bak option to overwrite the /etc/fstab file (preserving a copy of the original in /etc/fstab.bak). Or use -i $(date +'%Y%m%d.%H%M%S') to get a backup extension that is the current date/time. Or use version control on the file.

Best way to overwrite file with itself

What is the fastest / most elegant way to read out a file and then write the content to that same file?
On Linux, this is not always the same as 'touching' a file, e.g. if the file represents some hardware device.
One possibility that worked for me is echo $(cat $file) > $file but I wondered if this is best practice.
You cannot generically do this in one step because writing to the file can interfere with reading from it. If you try this on a regular file, it's likely the > redirection will blank the file out before anything is even read.
Your safest bet is to split it into two steps. You can hide it behind a function call if you want.
rewrite() {
local file=$1
local contents=$(< "$file")
cat <<< "$contents" > "$file"
}
...
rewrite /sys/misc/whatever
I understand that you are not interested in simple touch $file but rather a form of transformation of the file.
Since transformation of the file can cause its temporary corruption it is NOT good to make it in place while others may read it in the same moment.
Because of that temporary file is the best choice of yours, and only when you convert the file you should replace the file in a one step:
cat "$file" | process >"$file.$$"
mv "$file.$$" "$file"
This is the safest set of operations.
Of course I omitted operations like 'check if the file really exists' and so on, leaving it for you.
The sponge utility in the moreutils package allows this
~$ cat file
foo
~$ cat file | sponge file
~$ cat file
foo

Accessing each line using a $ sign in linux

Whenever I execute a linux command that outputs multiple lines, I want to perform some operation on each line of the output. generally i do
command something | while read a
do
some operation on $a;
done
This works fine. But my question is, Is there some how I can access each line by a predefined symbol( dont know how to call it) /// something like $? .. or .. $! .. or .. $_
Is it possible to do
cat to_be_removed.txt | rm -f $LINE
is there a predefined $LINE in bash .. or the previous one is the shortest way. ie.
cat to_be_removed.txt | while read line; do rm -f $line; done;
xargs is what you're looking for:
cat to_be_removed.txt | xargs rm -f
Watch out for spaces in your filenames if you use that one, though. Check out the xargs man page for more information.
You might be looking for the xargs command.
It takes control arguments, plus a command and optionally some arguments for the command. It then reads its standard input, normally splitting at white space, and then arranges to repeatedly execute the command with the given arguments and as many 'file names' read from the standard input as will fit on the command line.
rm -f $(<to_be_removed.txt)
This works because rm can take multiple files as input. It also makes it much more efficient because you only call rm once and you don't need to create a pipe to cat or xargs
On a separate note, rather than using pipes in a while loop, you can avoid a subshell by using process substitution:
while read line; do
some operation on $a;
done < <(command something)
The additional benefit you get by avoiding a subshell is that variables you change inside the loop maintain their altered values outside the loop as well. This is not the case when using the pipe form and it is a common gotcha.

Linux command to replace string in LARGE file with another string

I have a huge SQL file that gets executed on the server. The dump is from my machine and in it there are a few settings relating to my machine. So basically, I want every occurance of "c://temp" to be replace by "//home//some//blah"
How can this be done from the command line?
sed is a good choice for large files.
sed -i.bak -e 's%C://temp%//home//some//blah%' large_file.sql
It is a good choice because doesn't read the whole file at once to change it. Quoting the manual:
A stream editor is used to perform
basic text transformations on an input
stream (a file or input from a
pipeline). While in some ways similar
to an editor which permits scripted
edits (such as ed), sed works by
making only one pass over the
input(s), and is consequently more
efficient. But it is sed's ability to
filter text in a pipeline which
particularly distinguishes it from
other types of editors.
The relevant manual section is here. A small explanation follows
-i.bak enables in place editing leaving a backup copy with .bak extension
s%foo%bar% uses s, the substitution command, which
substitutes matches of first string
in between the % sign, 'foo', for the second
string, 'bar'. It's usually written as s//
but because your strings have plenty
of slashes, it's more convenient to
change them for something else so you
avoid having to escape them.
Example
vinko#mithril:~$ sed -i.bak -e 's%C://temp%//home//some//blah%' a.txt
vinko#mithril:~$ more a.txt
//home//some//blah
D://temp
//home//some//blah
D://temp
vinko#mithril:~$ more a.txt.bak
C://temp
D://temp
C://temp
D://temp
Just for completeness. In place replacement using perl.
perl -i -p -e 's{c://temp}{//home//some//blah}g' mysql.dmp
No backslash escapes required either. ;)
Try sed? Something like:
sed 's/c:\/\/temp/\/\/home\/\/some\/\/blah/' mydump.sql > fixeddump.sql
Escaping all those slashes makes this look horrible though, here's a simpler example which changes foo to bar.
sed 's/foo/bar/' mydump.sql > fixeddump.sql
As others have noted, you can choose your own delimiter, which would prevent the leaning toothpick syndrome in this case:
sed 's|c://temp\\|home//some//blah|' mydump.sql > fixeddump.sql
The clever thing about sed is that it operating on a stream rather than a file all at once, so you can process huge files using only a modest amount of memory.
There's also a non-standard UNIX utility, rpl, which does the exact same thing that the sed examples do; however, I'm not sure whether rpl operates streamwise, so sed may be the better option here.
The sed command can do that.
Rather than escaping the slashes, you can choose a different delimiter (_ in this case):
sed -e 's_c://temp/_/home//some//blah/_' file1.txt > file2.txt
perl -pi -e 's#c://temp#//home//some//blah#g' yourfilename
The -p will treat this script as a loop, it will read the specified file line by line running the regex search and replace.
-i This flag should be used in conjunction with the -p flag. This commands Perl to edit the file in place.
-e Just means execute this perl code.
Good luck
gawk
awk '{gsub("c://temp","//home//some//blah")}1' file

Resources