Improve performance of Bash loop that removes windows line endings

Improve performance of Bash loop that removes windows line endings - linux

Editor's note: This question was always about loop performance, but the original title led some answerers - and voters - to believe it was about how to remove Windows line endings.
The below bash loop below just remove the windows line endings and converts them to unix and appears to be running, but it is slow. The input files are small (4 files ranging from 167 bytes - 1 kb), and are all the same structure (list of names) and the only thing that varies is the length (ie. some files are 10 names others are 50). Is it supposed to take over 15 minutes to complete this task using a xeon processor? Thank you :)
for f in /home/cmccabe/Desktop/files/*.txt ; do
bname=`basename $f`
pref=${bname%%.txt}
sed 's/\r//' $f - $f > /home/cmccabe/Desktop/files/${pref}_unix.txt
done
Input .txt files
AP3B1
BRCA2
BRIP1
CBL
CTC1
EDIT
This is not a duplicate as I was more asking for why my bash loop that uses sed to remove windows line endings was running so slow. I did not mean to imply how to remove them, was asking for ideas that might speed up the loop and I got many. Thank you :). I hope this helps.

Use the utilities dos2unix and unix2dos to convert between unix and windows style line endings.

Your 'sed' command looks wrong. I believe the trailing $f - $f should simply be $f. Running your script as written hangs for a very long time on my system, but making this change causes it to complete almost instantly.
Of course, the best answer is to use dos2unix, which was designed to handle this exact thing:
cd /home/cmccabe/Desktop/files
for f in *.txt ; do
pref=$(basename -s '.txt' "$f")
dos2unix -q -n "$f" "${pref}_unix.txt"
done

This always works for me:
perl -pe 's/\r\n/\n/' inputfile.txt > outputfile.txt

you can use dos2unix as stated before or use this small sed:
sed 's/\r//' file

The key to performance in Bash is to avoid loops in general, and in particular those that call one or more external utilities in each iteration.
Here is a solution that uses a single GNU awk command:
awk -v RS='\r\n' '
BEGINFILE { outFile=gensub("\\.txt$", "_unix&", 1, FILENAME) }
{ print > outFile }
' /home/cmccabe/Desktop/files/*.txt
-v RS='\r\n' sets CRLF as the input record separator, and by virtue of leaving ORS, the output record separator at its default, \n, simply printing each input line will terminate it with \n.
the BEGINFILE block is executed every time processing of a new input file starts; in it, gensub() is used to insert _unix before the .txt suffix of the input file at hand to form the output filename.
{print > outFile} simply prints the \n-terminated lines to the output file at hand.
Note that use of a multi-char. RS value, the BEGINFILE block, and the gensub() function are GNU extensions to the POSIX standard.
Switching from the OP's sed solution to a GNU awk-based one was necessary in order to provide a single-command solution that is both simpler and faster.
Alternatively, here's a solution that relies on dos2unix for conversion of Window line-endings (for instance, you can install dos2unix with sudo apt-get install dos2unix on Debian-based systems); except for requiring dos2unix, it should work on most platforms (no GNU utilities required):
It uses a loop only to construct the array of filename arguments to pass to dos2unix - this should be fast, given that no call to basename is involved; Bash-native parameter expansion is used instead.
then uses a single invocation of dos2unix to process all files.
# cd to the target folder, so that the operations below do not need to handle
# path components.
cd '/home/cmccabe/Desktop/files'
# Collect all *.txt filenames in an array.
inFiles=( *.txt )
# Derive output filenames from it, using Bash parameter expansion:
# '%.txt' matches '.txt' at the end of each array element, and replaces it
# with '_unix.txt', effectively inserting '_unix' before the suffix.
outFiles=( "${inFiles[#]/%.txt/_unix.txt}" )
# Create an interleaved array of *input-output filename pairs* to be passed
# to dos2unix later.
# To inspect the resulting array, run `printf '%s\n' "${fileArgs[#]}"`
# You'll see pairs like these:
# file1.txt
# file1_unix.txt
# ...
fileArgs=(); i=0
for inFile in "${inFiles[#]}"; do
fileArgs+=( "$inFile" "${outFiles[i++]}" )
done
# Now, use a *single* invocation of dos2unix, passing all input-output
# filename pairs at once.
dos2unix -q -n "${fileArgs[#]}"

Related

Wildcard in sed command to replace string not working

I'm trying to use the sed command in terminal to replace a specific line in all my text files with a certain extension by a specific string:
sed -i.bak '35s/^.*$/5\) 1\-4/' fitting_file*.feedme
So I am trying to replace line 35 in each of these files with the string "5) 1-4". When I run an ls fitting_file*.feedme | wc -l command in this directory, I get 221 files. However, when I run the above sed command, it only edits the FIRST file in the order of ls fitting_file*.feedme. I know this because grep '5) 1-4' fitting_file*.feedme continually only returns the first file on the list after I run the replacement command. I also tried replacing fitting_file*.feedme with a space-separated list of a couple of these files in my sed command as a test, but it still only operated on the one I chose to list first. Why is this happening?

sed operates on a single stream. It essentially concats all the files together and treats that as a single stream. So it replaces the 35th line of the big concatenated stream.
To see this, make a 20 line file called A and a 20 line file called B. Apply your sed command as
sed -i.bak '35s/^.*$/5\) 1\-4/' A B
and you will see the 15th line of B replaced.
I think this should answer your direct question. As far how to get done what you like, I assume you've already figured out that wrapping your sed command in a for is one way to do it. :)

Try
Create a file containing your sed instruction like this
#!/bin/bash
sed -i.bak '35s/^.*$/5\) 1\-4/' $1
exit 0
and call it prog.sh. Next make it executable :
chmod u+x prog.sh
now you can solve your problem using
find . -name fitting_file\*.feedme -exec ./prog.sh {} \;
You could do all this on one line but frankly the number of escapes required is a bit much. Good luck.

To do what you're trying to do without using a shell loop is:
awk -i inplace -v inplace::suffix=.bak 'FNR==35{$0="5) 1-4"}1' fitting_file*.feedme
Note that unlike sed which can just count lines across all input files, awk has NR to track the number of records (lines by default) across all files and FNR for the same but just within the current file.
The above uses GNU awk for inplace editing just like GNU sed has -i for that. The default awk on MacOS is BSD awk, not GNU awk, but you should install GNU awk as it doesn't have all the bugs/quirks that BSD awk does and it has a ton of extremely useful extensions.
If you just want to use MacOS's awk then it'd be something like:
find . -name 'fitting_file*.feedme' -exec sh -c "\
awk 'FNR==35{\$0=\"5) 1-4\"}1' \"\$1\" > \"\$1.bak\" &&
mv -- \"\$1.bak\" \"\$1\"
" sh {} \;
which is obviously getting kinda complicated - I'd probably put the awk+mv script in a file to execute from sh -c or just resort to a shell loop myself if faced with that alternative (or a similar quoting nightmare with xargs)!

How can I remove the last character of a file in unix?

Say I have some arbitrary multi-line text file:
sometext
moretext
lastline
How can I remove only the last character (the e, not the newline or null) of the file without making the text file invalid?

A simpler approach (outputs to stdout, doesn't update the input file):
sed '$ s/.$//' somefile
$ is a Sed address that matches the last input line only, thus causing the following function call (s/.$//) to be executed on the last line only.
s/.$// replaces the last character on the (in this case last) line with an empty string; i.e., effectively removes the last char. (before the newline) on the line.
. matches any character on the line, and following it with $ anchors the match to the end of the line; note how the use of $ in this regular expression is conceptually related, but technically distinct from the previous use of $ as a Sed address.
Example with stdin input (assumes Bash, Ksh, or Zsh):
$ sed '$ s/.$//' <<< $'line one\nline two'
line one
line tw
To update the input file too (do not use if the input file is a symlink):
sed -i '$ s/.$//' somefile
Note:
On macOS, you'd have to use -i '' instead of just -i; for an overview of the pitfalls associated with -i, see the bottom half of this answer.
If you need to process very large input files and/or performance / disk usage are a concern and you're using GNU utilities (Linux), see ImHere's helpful answer.

truncate
truncate -s-1 file
Removes one (-1) character from the end of the same file. Exactly as a >> will append to the same file.
The problem with this approach is that it doesn't retain a trailing newline if it existed.
The solution is:
if [ -n "$(tail -c1 file)" ] # if the file has not a trailing new line.
then
truncate -s-1 file # remove one char as the question request.
else
truncate -s-2 file # remove the last two characters
echo "" >> file # add the trailing new line back
fi
This works because tail takes the last byte (not char).
It takes almost no time even with big files.
Why not sed
The problem with a sed solution like sed '$ s/.$//' file is that it reads the whole file first (taking a long time with large files), then you need a temporary file (of the same size as the original):
sed '$ s/.$//' file > tempfile
rm file; mv tempfile file
And then move the tempfile to replace the file.

Here's another using ex, which I find not as cryptic as the sed solution:
printf '%s\n' '$' 's/.$//' wq | ex somefile
The $ goes to the last line, the s deletes the last character, and wq is the well known (to vi users) write+quit.

After a whole bunch of playing around with different strategies (and avoiding sed -i or perl), the best way i found to do this was with:
sed '$! { P; D; }; s/.$//' somefile

If the goal is to remove the last character in the last line, this awk should do:
awk '{a[NR]=$0} END {for (i=1;i<NR;i++) print a[i];sub(/.$/,"",a[NR]);print a[NR]}' file
sometext
moretext
lastlin
It store all data into an array, then print it out and change last line.

Just a remark: sed will temporarily remove the file.
So if you are tailing the file, you'll get a "No such file or directory" warning until you reissue the tail command.

EDITED ANSWER
I created a script and put your text inside on my Desktop. this test file is saved as "old_file.txt"
sometext
moretext
lastline
Afterwards I wrote a small script to take the old file and eliminate the last character in the last line
#!/bin/bash
no_of_new_line_characters=`wc '/root/Desktop/old_file.txt'|cut -d ' ' -f2`
let "no_of_lines=no_of_new_line_characters+1"
sed -n 1,"$no_of_new_line_characters"p '/root/Desktop/old_file.txt' > '/root/Desktop/my_new_file'
sed -n "$no_of_lines","$no_of_lines"p '/root/Desktop/old_file.txt'|sed 's/.$//g' >> '/root/Desktop/my_new_file'
opening the new_file I created, showed the output as follows:
sometext
moretext
lastlin
I apologize for my previous answer (wasn't reading carefully)

sed 's/.$//' filename | tee newFilename
This should do your job.

A couple perl solutions, for comparison/reference:
(echo 1a; echo 2b) | perl -e '$_=join("",<>); s/.$//; print'
(echo 1a; echo 2b) | perl -e 'while(<>){ if(eof) {s/.$//}; print }'
I find the first read-whole-file-into-memory approach can be generally quite useful (less so for this particular problem). You can now do regex's which span multiple lines, for example to combine every 3 lines of a certain format into 1 summary line.
For this problem, truncate would be faster and the sed version is shorter to type. Note that truncate requires a file to operate on, not a stream. Normally I find sed to lack the power of perl and I much prefer the extended-regex / perl-regex syntax. But this problem has a nice sed solution.

How to make many edits to files, without writing to the harddrive very often, in BASH?

I often need to make many edits to text files. The files are typically 20 MB in size and require ~500,000 individual edits, all which must be made in a very specific order. Here is a simple example of a script I might need to use:
while read -r line
do
...
(20-100 lines of BASH commands preparing $a and $b)
...
sed -i "s/$a/$b/g" ./editfile.txt
...
done < ./readfile.txt
As many other lines of code appear before and after the sed script, it seems the only option for editing the file is sed with the -i option. Many have warned me against using sed -i, as that makes too many writes to the file. Recently, I had to replace two computers, as the hard drives stopped working after running the scripts. I need to find a solution that does not damage my computer's hardware.
Is there some way to send files somewhere else, such as storing the whole file into a BASH variable, or into RAM, where I sed, grep, and awk, can make the edits without making millions of writes to the hard drive?

Don't use sed -i once per transform. A far better approach -- leaving you with more control -- is to construct a pipeline (if you can't use a single sed with multiple -e arguments to perform multiple operations within a single instance), and redirect to or from disk at only the beginning and end.
This can even be done recursively, if you use a FD other than stdin for reading from your file:
editstep() {
read -u 3 -r line # read from readfile into REPLY
if [[ $REPLY ]]; then # we read something new from readfile
sed ... | editstep # perform the edits, then a recursive call!
else
cat
fi
}
editstep <editfile.txt >editfile.txt.new 3<readfile.txt
Better than that, though, is to consolidate to a single sed instance.
sed_args=( )
while read -r line; do
sed_args+=( -e "s/in/out/" )
done <readfile.txt
sed -i "${sed_args[#]}" editfile.txt
...or, for edit lists too long to pass in on the command line:
sed_args=( )
while read -r line; do
sed_args+=( "s/in/out/" )
done <readfile.txt
sed -i -f <(printf '%s\n' "${sed_args[#]}") editfile.txt
(Please don't read the above as an endorsement of sed -i, which is a non-POSIX extension and has its own set of problems; the POSIX-specified editor intended for in-place rather than streaming operations is ex, not sed).
Even better? Don't use sed at all, but keep all the operations inline in native bash.
Consider the following:
content=$(<editfile.txt)
while IFS= read -r; do
# put your own logic here to set `in` and `out`
content=${content//$in/$out}
done <readfile.txt
printf '%s\n' "$content" >editfile.new
One important caveat: This approach treats in as a literal string, not a regular expression. Depending on the edits you're actually making, this may actually improve correctness over the original code... but in any event, it's worth being aware of.
Another caveat: Reading the file's contents into a bash string is not necessarily a lossless operation; expect content to be truncated at the first NUL byte (if any exist), and a trailing newline to be added at the end of the file if none existed before.

simple ...
instead of trying too many threads, you can simple copy all your files and dirs to /dev/shm
This is representation of ram drive. When you are done editing, copy all back to the original destination. Do not forget to run sync after you are done :-)

How do I insert the results of several commands on a file as part of my sed stream?

I use DJing software on linux (xwax) which uses a 'scanning' script (visible here) that compiles all the music files available to the software and outputs a string which contains a path to the filename and then the title of the mp3. For example, if it scans path-to-mp3/Artist - Test.mp3, it will spit out a string like so:
path-to-mp3/Artist - Test.mp3[tab]Artist - Test
I have tagged all my mp3s with BPM information via the id3v2 tool and have a commandline method for extracting that information as follows:
id3v2 -l name-of-mp3.mp3 | grep TBPM | cut -D: -f2
That spits out JUST the numerical BPM to me. What I'd like to do is prepend the BPM number from the above command as part of the xwax scanning script, but I'm not sure how to insert that command in the midst of the script. What I'd want it to generate is:
path-to-mp3/Artist - Test.mp3[tab][bpm]Artist - Test
Any ideas?

It's not clear to me where in that script you want to insert the BPM number, but the idea is this:
To embed the output of one command into the arguments of another, you can use the "command substitution" notation `...` or $(...). For example, this:
rm $(echo abcd)
runs the command echo abcd and substitutes its output (abcd) into the overall command; so that's equivalent to just rm abcd. It will remove the file named abcd.
The above doesn't work inside single-quotes. If you want, you can just put it outside quotes, as I did in the above example; but it's generally safer to put it inside double-quotes (so as to prevent some unwanted postprocessing). Either of these:
rm "$(echo abcd)"
rm "a$(echo bc)d"
will remove the file named abcd.
In your case, you need to embed the command substitution into the middle of an argument that's mostly single-quoted. You can do that by simply putting the single-quoted strings and double-quoted strings right next to each other with no space in between, so that Bash will combine them into a single argument. (This also works with unquoted strings.) For example, either of these:
rm a"$(echo bc)"d
rm 'a'"$(echo bc)"'d'
will remove the file named abcd.
Edited to add: O.K., I think I understand what you're trying to do. You have a command that either (1) outputs out all the files in a specified directory (and any subdirectories and so on), one per line, or (2) outputs the contents of a file, where the contents of that file is a list of files, one per line. So in either case, it's outputting a list of files, one per line. And you're piping that list into this command:
sed -n '
{
# /[<num>[.]] <artist> - <title>.ext
s:/\([0-9]\+.\? \+\)\?\([^/]*\) \+- \+\([^/]*\)\.[A-Z0-9]*$:\0\t\2\t\3:pi
t
# /<artist> - <album>[/(Disc|Side) <name>]/[<ABnum>[.]] <title>.ext
s:/\([^/]*\) \+- \+\([^/]*\)\(/\(disc\|side\) [0-9A-Z][^/]*\)\?/\([A-H]\?[A0-9]\?[0-9].\? \+\)\?\([^/]*\)\.[A-Z0-9]*$:\0\t\1\t\6:pi
t
# /[<ABnum>[.]] <name>.ext
s:/\([A-H]\?[A0-9]\?[0-9].\? \+\)\?\([^/]*\)\.[A-Z0-9]*$:\0\t\t\2:pi
}
'
which runs a sed script over that list. What you want is for all of the replacement-strings to change from \0\t... to \0\tBPM\t..., where BPM is the BPM number computed from your command. Right? And you need to compute that BPM number separately for each file, so instead of relying on seds implicit line-by-line looping, you need to handle the looping yourself, and process one line at a time. Right?
So, you should change the above command to this:
while read -r LINE ; do # loop over the lines, saving each one as "$LINE"
BPM=$(id3v2 -l "$LINE" | grep TBPM | cut -D: -f2) # save BPM as "$BPM"
sed -n '
{
# /[<num>[.]] <artist> - <title>.ext
s:/\([0-9]\+.\? \+\)\?\([^/]*\) \+- \+\([^/]*\)\.[A-Z0-9]*$:\0\t'"$BPM"'\t\2\t\3:pi
t
# /<artist> - <album>[/(Disc|Side) <name>]/[<ABnum>[.]] <title>.ext
s:/\([^/]*\) \+- \+\([^/]*\)\(/\(disc\|side\) [0-9A-Z][^/]*\)\?/\([A-H]\?[A0-9]\?[0-9].\? \+\)\?\([^/]*\)\.[A-Z0-9]*$:\0\t'"$BPM"'\t\1\t\6:pi
t
# /[<ABnum>[.]] <name>.ext
s:/\([A-H]\?[A0-9]\?[0-9].\? \+\)\?\([^/]*\)\.[A-Z0-9]*$:\0\t'"$BPM"'\t\t\2:pi
}
' <<<"$LINE" # take $LINE as input, rather than reading more lines
done
(where the only change to the sed script itself was to insert '"$BPM"'\t in a few places to switch from single-quoting to double-quoting, then insert the BPM, then switch back to single-quoting and add a tab).

Extracting sub-strings in Unix

I'm using cygwin on Windows 7. I want to loop through a folder consisting of about 10,000 files and perform a signal processing tool's operation on each file. The problem is that the files names have some excess characters that are not compatible with the operation. Hence, I need to extract just a certain part of the file names.
For example if the file name is abc123456_justlike.txt.rna I need to use abc123456_justlike.txt. How should I write a loop to go through each file and perform the operation on the shortened file names?
I tried the cut - b1-10 command but that doesn't let my tool perform the necessary operation. I'd appreciate help with this problem

Try some shell scripting, using the ${NAME%TAIL} parameter substitution: the contents of variable NAME are expanded, but any suffix material which matches the TAIL glob pattern is chopped off.
$ NAME=abc12345.txt.rna
$ echo ${NAME%.rna} #
# process all files in the directory, taking off their .rna suffix
$ for x in *; do signal_processing_tool ${x%.rna} ; done
If there are variations among the file names, you can classify them with a case:
for x in * ; do
case $x in
*.rna )
# do something with .rna files
;;
*.txt )
# do something else with .txt files
;;
* )
# default catch-all-else case
;;
esac
done

Try sed:
echo a.b.c | sed 's/\.[^.]*$//'
The s command in sed performs a search-and-replace operation, in this case it replaces the regular expression \.[^.]*$ (meaning: a dot, followed by any number of non-dots, at the end of the string) with the empty string.
If you are not yet familiar with regular expressions, this is a good point to learn them. I find manipulating string using regular expressions much more straightforward than using tools like cut (or their equivalents).

If you are trying to extract the list of filenames from a directory use the below command.
ls -ltr | awk -F " " '{print $9}' | cut -c1-10

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Improve performance of Bash loop that removes windows line endings - linux

Use the utilities dos2unix and unix2dos to convert between unix and windows style line endings.

This always works for me: perl -pe 's/\r\n/\n/' inputfile.txt > outputfile.txt

you can use dos2unix as stated before or use this small sed: sed 's/\r//' file

Related

Wildcard in sed command to replace string not working

How can I remove the last character of a file in unix?

How to make many edits to files, without writing to the harddrive very often, in BASH?

How do I insert the results of several commands on a file as part of my sed stream?

Extracting sub-strings in Unix

Categories

Resources