How can I remove the BOM from a UTF-8 file? [duplicate] - linux

This question already has answers here:
Remove file coding mark but preserve its coding
(2 answers)
Closed 1 year ago.
I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?
$ file test.xml
test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines

Using VIM
Open file in VIM:
vi text.xml
Remove BOM encoding:
:set nobomb
Save and quit:
:wq
For a non-interactive solution, try the following command line:
vi -c ":set nobomb" -c ":wq" text.xml
That should remove the BOM, save the file and quit, all from the command line.

A BOM is Unicode codepoint U+FEFF; the UTF-8 encoding consists of the three hex values 0xEF, 0xBB, 0xBF.
With bash, you can create a UTF-8 BOM with the $'' special quoting form, which implements Unicode escapes: $'\uFEFF'. So with bash, a reliable way of removing a UTF-8 BOM from the beginning of a text file would be:
sed -i $'1s/^\uFEFF//' file.txt
This will leave the file unchanged if it does not start with a UTF-8 BOM, and otherwise remove the BOM.
If you are using some other shell, you might find that "$(printf '\ufeff')" produces the BOM character (that works with zsh as well as any shell without a printf builtin, provided that /usr/bin/printf is the Gnu version ), but if you want a Posix-compatible version you could use:
sed "$(printf '1s/^\357\273\277//')" file.txt
(The -i in-place edit flag is also a Gnu extension; this version writes the possibly-modified file to stdout.)

Well, just dealt with this today and my preferred way was dos2unix:
dos2unix will remove BOM and also take care of other idiosyncrasies from other SOs:
$ sudo apt install dos2unix
$ dos2unix test.xml
It's also possible to remove BOM only (-r, --remove-bom):
$ dos2unix -r test.xml
Note: tested with dos2unix 7.3.4

IF you are certain that a given file starts with a BOM, then it is possible to remove the BOM from a file with the tail command:
tail --bytes=+4 withBOM.txt > withoutBOM.txt

Joshua Pinter's answer works correctly on mac so I wrote a script that removes the BOM from all files in a given folder, see here.
It can be used like follows:
Remove BOM from all files in current directory: rmbom .
Print all files with a BOM in the current directory: rmbom . -a
Only remove BOM from all files in current directory with extension txt or cs: rmbom . -e txt -e cs

If you want to work on a bulk of files, by improving Reginaldo Santos's answers, there is a quick way:
find . -name "*.java" | grep java$ | xargs -n 1 dos2unix

Related

Linux: iconv ASCII text to UTF-8

I have a file which contains the letter ö. Except that it doesn't. When I open the file in gedit, I see:
\u00f6
I tried to convert the file, applying code that I found on other threads:
$ file blb.txt
blb.txt: ASCII text
$ iconv -f ISO-8859-15 -t UTF-8 blb.txt > blb_tmp.txt
$ file blb_tmp.txt
blb_tmp.txt: ASCII text
What am I missing?
EDIT
I found this solution:
echo -e "$(cat blb.txt)" > blb_tmp.txt
$ file blb_tmp.txt
blb_tmp.txt: UTF-8 Unicode text
The -e "enables interpretation of backslash escapes".
Still not sure why iconv didn't make it happen. I'm guessing it's something like "iconv only changes the encoding, it doesn't interpret". Not sure yet, what the difference is though. Why did the Unicode people make this world such a mess? :D

How to replace a string containing "\u2015"?

Does anyone know how to replace a string containing \u2015 in a SED command like the example below?
sed -ie "s/some text \u2015 some more text/new text/" inputFileName
You just need to escape the slashes present. Below example works fine in GNU sed version 4.2.1
$ echo "some text \u2015 some more text" | sed -e "s/some text \\\u2015 some more text/abc/"
$ abc
Also you don't have to use the -i flag which according to the the man page is only for editing files in-place.
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if extension supplied). The default operation mode is to break symbolic and hard links. This can be changed with --follow-symlinks and
--copy.
Not sure if this is exactly what you need, but maybe you should take a look at native2ascii tool to convert such unicode escapes.
Normally it replaces all characters that cannot be displayed in ISO-8859-1 with their unicodes (escaped with \u), but it also supports reverse conversions. Assuming you have some file in UTF-8 named "input" containing \u00abSome \u2015 string\u00bb, then executing
native2ascii -encoding UTF-8 -reverse input output
will result in "output" file with «Some ― string».

Improve performance of Bash loop that removes windows line endings

Editor's note: This question was always about loop performance, but the original title led some answerers - and voters - to believe it was about how to remove Windows line endings.
The below bash loop below just remove the windows line endings and converts them to unix and appears to be running, but it is slow. The input files are small (4 files ranging from 167 bytes - 1 kb), and are all the same structure (list of names) and the only thing that varies is the length (ie. some files are 10 names others are 50). Is it supposed to take over 15 minutes to complete this task using a xeon processor? Thank you :)
for f in /home/cmccabe/Desktop/files/*.txt ; do
bname=`basename $f`
pref=${bname%%.txt}
sed 's/\r//' $f - $f > /home/cmccabe/Desktop/files/${pref}_unix.txt
done
Input .txt files
AP3B1
BRCA2
BRIP1
CBL
CTC1
EDIT
This is not a duplicate as I was more asking for why my bash loop that uses sed to remove windows line endings was running so slow. I did not mean to imply how to remove them, was asking for ideas that might speed up the loop and I got many. Thank you :). I hope this helps.
Use the utilities dos2unix and unix2dos to convert between unix and windows style line endings.
Your 'sed' command looks wrong. I believe the trailing $f - $f should simply be $f. Running your script as written hangs for a very long time on my system, but making this change causes it to complete almost instantly.
Of course, the best answer is to use dos2unix, which was designed to handle this exact thing:
cd /home/cmccabe/Desktop/files
for f in *.txt ; do
pref=$(basename -s '.txt' "$f")
dos2unix -q -n "$f" "${pref}_unix.txt"
done
This always works for me:
perl -pe 's/\r\n/\n/' inputfile.txt > outputfile.txt
you can use dos2unix as stated before or use this small sed:
sed 's/\r//' file
The key to performance in Bash is to avoid loops in general, and in particular those that call one or more external utilities in each iteration.
Here is a solution that uses a single GNU awk command:
awk -v RS='\r\n' '
BEGINFILE { outFile=gensub("\\.txt$", "_unix&", 1, FILENAME) }
{ print > outFile }
' /home/cmccabe/Desktop/files/*.txt
-v RS='\r\n' sets CRLF as the input record separator, and by virtue of leaving ORS, the output record separator at its default, \n, simply printing each input line will terminate it with \n.
the BEGINFILE block is executed every time processing of a new input file starts; in it, gensub() is used to insert _unix before the .txt suffix of the input file at hand to form the output filename.
{print > outFile} simply prints the \n-terminated lines to the output file at hand.
Note that use of a multi-char. RS value, the BEGINFILE block, and the gensub() function are GNU extensions to the POSIX standard.
Switching from the OP's sed solution to a GNU awk-based one was necessary in order to provide a single-command solution that is both simpler and faster.
Alternatively, here's a solution that relies on dos2unix for conversion of Window line-endings (for instance, you can install dos2unix with sudo apt-get install dos2unix on Debian-based systems); except for requiring dos2unix, it should work on most platforms (no GNU utilities required):
It uses a loop only to construct the array of filename arguments to pass to dos2unix - this should be fast, given that no call to basename is involved; Bash-native parameter expansion is used instead.
then uses a single invocation of dos2unix to process all files.
# cd to the target folder, so that the operations below do not need to handle
# path components.
cd '/home/cmccabe/Desktop/files'
# Collect all *.txt filenames in an array.
inFiles=( *.txt )
# Derive output filenames from it, using Bash parameter expansion:
# '%.txt' matches '.txt' at the end of each array element, and replaces it
# with '_unix.txt', effectively inserting '_unix' before the suffix.
outFiles=( "${inFiles[#]/%.txt/_unix.txt}" )
# Create an interleaved array of *input-output filename pairs* to be passed
# to dos2unix later.
# To inspect the resulting array, run `printf '%s\n' "${fileArgs[#]}"`
# You'll see pairs like these:
# file1.txt
# file1_unix.txt
# ...
fileArgs=(); i=0
for inFile in "${inFiles[#]}"; do
fileArgs+=( "$inFile" "${outFiles[i++]}" )
done
# Now, use a *single* invocation of dos2unix, passing all input-output
# filename pairs at once.
dos2unix -q -n "${fileArgs[#]}"

How do I convert text files in bash? (foo.txt to bar.txt)

I'm trying to append, via bash, one text file to another using foo.txt >> bar.txt. The problem is, foo.txt is a windows text file (I think), because once appended in bar.txt
every new^M
line^M
ends^M
like this.^M
Thus, they are surely two different file-types.
I've searched Google for answers but Google doesn't accept special characters like ">>" and "^" in search. The closest solution I can find is a reported solution on commandlinefu.com, but all it does is strip the r's from bar.txt which, really, is no solution at all.
So, I have two questions here:
How do I discover, using bash, what file-type foo.txt really is?
How do convert foo.txt properly to make it appendable to bar.txt?
sed 's/$'"/`echo \\\r`/" foo.txt >> bar.txt
or use dos2unix, if it's available.
Convert the Windows line endings with dos2unix:
dos2unix foo.txt
file foo.txt will output: "ASCII text, with CRLF line terminators" if the file has DOS-style line endings.
Use fromdos and todos, or unix2dos and dos2unix.
Look up the commands dos2unix and unix2dos. You may wish to make a copy first (cp)
A quick-and-dirty solution:
tr -d '\r' < foo.txt

Why is my Bash script adding <feff> to the beginning of files?

I've written a script that cleans up .csv files, removing some bad commas and bad quotes (bad, means they break an in house program we use to transform these files) using sed:
# remove all commas, and re-insert the good commas using clean.sed
sed -f clean.sed $1 > $1.1st
# remove all quotes
sed 's/\"//g' $1.1st > $1.tmp
# add the good quotes around good commas
sed 's/\,/\"\,\"/g' $1.tmp > $1.tmp1
# add leading quotes
sed 's/^/\"/' $1.tmp1 > $1.tmp2
# add trailing quotes
sed 's/$/\"/' $1.tmp2 > $1.tmp3
# remove utf characters
sed 's/<feff>//' $1.tmp3 > $1.tmp4
# replace original file with new stripped version and delete .tmp files
cp -rf $1.tmp4 quotes_$1
Here is clean.sed:
s/\",\"/XXX/g;
:a
s/,//g
ta
s/XXX/\",\"/g;
Then it removes the temp files and viola we have a new file that starts with the word "quotes" that we can use for our other processes.
My question is:
Why do I have to make a sed statement to remove the feff tag in that temp file? The original file doesn't have it, but it always appears in the replacement. At first I thought cp was causing this but if I put in the sed statement to remove before the cp, it isn't there.
Maybe I'm just missing something...
U+FEFF is the code point for a byte order mark. Your files most likely contain data saved in UTF-16 and the BOM has been corrupted by your 'cleaning process' which is most likely expecting ASCII. It's probably not a good idea to remove the BOM, but instead to fix your scripts to not corrupt it in the first place.
To get rid of these in GNU emacs:
Open Emacs
Do a find-file-literally to open the file
Edit off the leading three bytes
Save the file
There is also a way to convert files with DOS line termination convention to Unix line termination convention.

Resources