Linux: iconv ASCII text to UTF-8 - linux

I have a file which contains the letter ö. Except that it doesn't. When I open the file in gedit, I see:
\u00f6
I tried to convert the file, applying code that I found on other threads:
$ file blb.txt
blb.txt: ASCII text
$ iconv -f ISO-8859-15 -t UTF-8 blb.txt > blb_tmp.txt
$ file blb_tmp.txt
blb_tmp.txt: ASCII text
What am I missing?
EDIT
I found this solution:
echo -e "$(cat blb.txt)" > blb_tmp.txt
$ file blb_tmp.txt
blb_tmp.txt: UTF-8 Unicode text
The -e "enables interpretation of backslash escapes".
Still not sure why iconv didn't make it happen. I'm guessing it's something like "iconv only changes the encoding, it doesn't interpret". Not sure yet, what the difference is though. Why did the Unicode people make this world such a mess? :D

Related

How can I remove the BOM from a UTF-8 file? [duplicate]

This question already has answers here:
Remove file coding mark but preserve its coding
(2 answers)
Closed 1 year ago.
I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?
$ file test.xml
test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines
Using VIM
Open file in VIM:
vi text.xml
Remove BOM encoding:
:set nobomb
Save and quit:
:wq
For a non-interactive solution, try the following command line:
vi -c ":set nobomb" -c ":wq" text.xml
That should remove the BOM, save the file and quit, all from the command line.
A BOM is Unicode codepoint U+FEFF; the UTF-8 encoding consists of the three hex values 0xEF, 0xBB, 0xBF.
With bash, you can create a UTF-8 BOM with the $'' special quoting form, which implements Unicode escapes: $'\uFEFF'. So with bash, a reliable way of removing a UTF-8 BOM from the beginning of a text file would be:
sed -i $'1s/^\uFEFF//' file.txt
This will leave the file unchanged if it does not start with a UTF-8 BOM, and otherwise remove the BOM.
If you are using some other shell, you might find that "$(printf '\ufeff')" produces the BOM character (that works with zsh as well as any shell without a printf builtin, provided that /usr/bin/printf is the Gnu version ), but if you want a Posix-compatible version you could use:
sed "$(printf '1s/^\357\273\277//')" file.txt
(The -i in-place edit flag is also a Gnu extension; this version writes the possibly-modified file to stdout.)
Well, just dealt with this today and my preferred way was dos2unix:
dos2unix will remove BOM and also take care of other idiosyncrasies from other SOs:
$ sudo apt install dos2unix
$ dos2unix test.xml
It's also possible to remove BOM only (-r, --remove-bom):
$ dos2unix -r test.xml
Note: tested with dos2unix 7.3.4
IF you are certain that a given file starts with a BOM, then it is possible to remove the BOM from a file with the tail command:
tail --bytes=+4 withBOM.txt > withoutBOM.txt
Joshua Pinter's answer works correctly on mac so I wrote a script that removes the BOM from all files in a given folder, see here.
It can be used like follows:
Remove BOM from all files in current directory: rmbom .
Print all files with a BOM in the current directory: rmbom . -a
Only remove BOM from all files in current directory with extension txt or cs: rmbom . -e txt -e cs
If you want to work on a bulk of files, by improving Reginaldo Santos's answers, there is a quick way:
find . -name "*.java" | grep java$ | xargs -n 1 dos2unix

How to replace a string containing "\u2015"?

Does anyone know how to replace a string containing \u2015 in a SED command like the example below?
sed -ie "s/some text \u2015 some more text/new text/" inputFileName
You just need to escape the slashes present. Below example works fine in GNU sed version 4.2.1
$ echo "some text \u2015 some more text" | sed -e "s/some text \\\u2015 some more text/abc/"
$ abc
Also you don't have to use the -i flag which according to the the man page is only for editing files in-place.
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if extension supplied). The default operation mode is to break symbolic and hard links. This can be changed with --follow-symlinks and
--copy.
Not sure if this is exactly what you need, but maybe you should take a look at native2ascii tool to convert such unicode escapes.
Normally it replaces all characters that cannot be displayed in ISO-8859-1 with their unicodes (escaped with \u), but it also supports reverse conversions. Assuming you have some file in UTF-8 named "input" containing \u00abSome \u2015 string\u00bb, then executing
native2ascii -encoding UTF-8 -reverse input output
will result in "output" file with «Some ― string».

Linux script to automatically convert file type to UTF8

I am in a tight spot and could use some help coming up with a linux shell script to convert a directory full of pipes delimited files from their original file encoding to UTF-8. The source files are either US-ASCII or ISO-8859-1 file encoding. The closest thing that I could come up with is:
iconv -f ISO8859-1 -t utf-8 * > name_of_utf8_file
This condenses all of the files into a single file which is not needed but OK for this application. The problem is that I neeed to specify both the source and destination file encoding, so for half of the files I don't know what it does. Is there way to write a shell script using commands like file -i or the like.
Any advice here is much appreciated.
This is, (not properly tested, caveat emptor :)), one way of doing it:
Maybe try w/ a small subset first - this is more of a thought example than a turn-key solution.
for i in *
do
if $( file -i "${i}"|grep -q us-ascii ); then
iconv -f us-ascii -t utf-8 "$i" > "${i}.utf8"
fi
if $( file -i "${i}"|grep -q iso-8859-1 ); then
iconv -f iso8859-1 -t utf-8 "$i" > "${i}.utf8"
fi
done

Finding the encoding of text files

I have a bunch of text files with different encodings. But I want to convert all of the into utf-8. since there are about 1000 files, I cant do it manually. I know that there are some commands in llinux which change the encodings of files from one encoding into another one. but my question is how to automatically detect the current encoding of a file? Clearly I'm looking for a command (say FindEncoding($File) ) to do this:
foreach file
do
$encoding=FindEncoding($File);
uconv -f $encoding -t utf-8 $file;
done
I usually do sth like this:
for f in *.txt; do
encoding=$(file -i "$f" | sed "s/.*charset=\(.*\)$/\1/")
recode $encoding..utf-8 "$f"
done
Note that recode will overwrite the file for changing the character encoding.
If it is not possible to identify the text files by extension, their respective mime type can be determined with file -bi | cut -d ';' -f 1.
It is also probably a good idea to avoid unnecessary re-encodings by checking on UFT-8 first:
if [ ! "$encoding" = "utf-8" ]; then
#encode
After this treatment, there might still be some files with an us-ascii encoding. The reason for that is ASCII being a subset of UTF-8 that remains in use unless any characters are introduced that are not expressible by ASCII. In that case, the encoding switches to UTF-8.

Check if file contains multibyte character

I have some subtitle files in UTF-8. Sometimes there are some sporadic multibyte characters in these files which cause problem in some applications.
How do I check in linux (and possibility locate these) if a certain file contains any multibyte character.
You can use file command
chalet16$ echo test > a.txt
chalet16$ echo testก > b.txt #One of Thai characters
chalet16$ file *.txt
a.txt: ASCII text
b.txt: UTF-8 Unicode text
You can use file or chardet command.

Resources