Check File encoding (UTF-8/ANSI) in AIX Machine - linux

I need to convert file encoding from UTF-8 to ANSI format on AIX machine, iconv works but need to check the file encoding before conversion? I am looking for some options that I can add a condition to on top the iconv command. Convert the file only if its in UTF-8 encoding.
iconv -f utf-8 -t windows-1252 abc.dat > abc_new.dat

Related

iconv command is not changing the encoding of a plain text file to another encoding

In Linux I created a plain text file. using "file -i" I am seeing file encoding is "us-ascii" . After trying below commands it is still showing output file encoding as "us-ascii". Could you please tell me how to change encoding? or Is there any way to download some encoded file which I can't read.
iconv -f US-ASCII -t ISO88592//TRANSLIT -o o.txt ip.txt
iconv -f UTF-8 -t ISO-8859-1//TRANSLIT -o op.txt ip.txt
I am expecting either iconv change the encoding or I can download some encoded file.
If your file contains only ASCII character, then there's no difference between the ASCII, UTF-8 and different ISO8859-x encoding. So after conversion, you will end up with the exactly same file.
A text file does not store any information about what encoding was used. Therefore, the file applies a few rules but at the end of the day, it's just a guess. And as the files are identical, the result will alwazys be the same.
To see a difference, you will must use characters that are encoded differently with the different encoding or are not avaialbe at all in one of the encodings, e.g. ă, € or 😊.

Convert files between UTF-8 and ISO-8859 on Linux

Every time that I get confronted with Unicode, nothing works. I'm on Linux, and I got these files from Windows:
$file *
file1: UTF-8 Unicode text
file2: ISO-8859 text
file3: ISO-8859 text
Nothing was working until I found out that the files have different encodings. I want to make my life easy and have them all in the same format:
iconv -f UTF-8 -t ISO-8859 file1 > test
iconv: conversion to `ISO-8859' is not supported
Try `iconv --help' or `iconv --usage' for more information.
I tried to convert to ISO because that's only 1 conversion + when I open those ISO files in gedit, the German letter "ü" is displayed just fine. Okay, next try:
iconv -f ISO-8859 -t UTF-8 file2 > test
iconv: conversion from `ISO-8859' is not supported
Try `iconv --help' or `iconv --usage' for more information.
but obviously that didn't work.
ISO-8859-x (Latin-1) encoding only contains very limited characters, you should always try to encode to UTF-8 to make life easier.
And utf-8 (Unicode) is a superset of ISO 8859 so it will be not surprised you could not convert UTF-8 to ISO 8859
It seems command file just give a very limited info of the file encoding
You could try to guess the from encoding either ISO-8859-1 or ISO-8859-15 or the other from 2~14 as suggested in the comment by #hobbs
And you could get a supported encoding of iconv by iconv -l
If life treats you not easy with guessing the real file encoding, this silly script might help you out :D
As in other answers, you can list out the supported formats
iconv -l | grep 8859
A grep will save your time to find which version of your encoding is/are supported. You can provide the <number> as in my example or ISO or any expected string in your encoding.

How to avoid iconv error to halt my program

I have a utf-8 document to be convert to big5 encoding using iconv with the code below
iconv -f utf-8 -t big5 $inputFile -o $outputFile
However there are some utf-8 characters encoding is not complete because I set byte size limit in each line in the document like 40 bytes in a line so some utf-8 characters will be cut.
Since the incomplete encoding of utf-8 characters leads to the error that iconv cannot find the corresponding big5 encode for the incomplete utf-8 characters encoding and the iconv stops.
Is there any why to avoid the iconv to halt and skip the incomplete utf-8 characters encoding and continue convert the following document to big5 encoding document?
I'm not sure that is what you are looking for, but, to quote man iconv:
DESCRIPTION
The iconv program converts the encoding of characters in inputfile, or
from the standard input if no filename is specified, from one coded
character set to another.
OPTIONS
-c Omit invalid characters from output.
[...]
The man is not really clear, but when you use that option, characters in the source file invalid given the source encoding are discarded.

How can I change text file encoding?

I want to change file encoding utf-8 -> euc-kr.
Thanks!
Use iconv tool:
iconv -f utf-8 -t euc-kr file
There is a node.js library at https://github.com/bnoordhuis/node-iconv that might help you without resorting to external tools.

encoding problem?

i work with txt files, and i recently found e.g. these characters in a few of them:
http://pastebin.com/raw.php?i=Bdj6J3f4
what could these characters be? wrong character-encoding? i just want to use normal UTF-8 TXT files, but when i use:
iconv -t UTF-8 input.txt > output.txt
it's still the same.
When i open the files in gedit, copy+paste them in another txt files, then there's no characters like in the ones in pastebin. so gedit can solve this problem, it encodes the TXT files well. but there are too many txt files.
why are there http://pastebin.com/raw.php?i=Bdj6J3f4 -like chars in the text files? can they be converted to "normal chars"? I can't see e.g.: the "Ì" char, when i open the files with vim, only after i "work with them" (e.g.: awk, etc)
It would help if you posted the actual binary content of your file (perhaps by using the output of od -t x1). The pastebin returns this as HTML:
"Ì"
"Ã"
"é"
The first line corresponds to U+00C3 U+0152. THe last line corresponds to U+00C3 U+00A9, which is the string "\ux00e9" in UTF ("\xc3\xa9") with the UTF-8 bytes reinterpreted as Latin-1.
From man iconv:
The iconv program converts text from
one encoding to another encoding. More
precisely, it converts from the
encoding given for the -f option to
the encoding given for the -t option.
Either of these encodings defaults to
the encoding of the current locale
Because you didn't specify the -f option it assumes the file is encoded with your current locale's encoding (probably UTF-8), which apparently is not true. Your text editors (gedit, vim) do some encoding detection - you can check which encoding do they detect (I don't know how - I don't use any of them) and use that as -f iconv option (or save the open file with your desired encoding using one of those text editors).
You can also use some tool for encoding detection like Python chardet module:
$ python -c "import chardet as c; print c.detect(open('file.txt').read(4096))"
{'confidence': 0.7331842298102511, 'encoding': 'ISO-8859-2'}
..solved !
how:
i just right clicked on the folders containing the TXT files, and pasted them to another folder.. :O and presto..theres no more ugly chars..

Resources