How can I change text file encoding? - node.js

I want to change file encoding utf-8 -> euc-kr.
Thanks!

Use iconv tool:
iconv -f utf-8 -t euc-kr file
There is a node.js library at https://github.com/bnoordhuis/node-iconv that might help you without resorting to external tools.

Related

Check File encoding (UTF-8/ANSI) in AIX Machine

I need to convert file encoding from UTF-8 to ANSI format on AIX machine, iconv works but need to check the file encoding before conversion? I am looking for some options that I can add a condition to on top the iconv command. Convert the file only if its in UTF-8 encoding.
iconv -f utf-8 -t windows-1252 abc.dat > abc_new.dat

iconv command is not changing the encoding of a plain text file to another encoding

In Linux I created a plain text file. using "file -i" I am seeing file encoding is "us-ascii" . After trying below commands it is still showing output file encoding as "us-ascii". Could you please tell me how to change encoding? or Is there any way to download some encoded file which I can't read.
iconv -f US-ASCII -t ISO88592//TRANSLIT -o o.txt ip.txt
iconv -f UTF-8 -t ISO-8859-1//TRANSLIT -o op.txt ip.txt
I am expecting either iconv change the encoding or I can download some encoded file.
If your file contains only ASCII character, then there's no difference between the ASCII, UTF-8 and different ISO8859-x encoding. So after conversion, you will end up with the exactly same file.
A text file does not store any information about what encoding was used. Therefore, the file applies a few rules but at the end of the day, it's just a guess. And as the files are identical, the result will alwazys be the same.
To see a difference, you will must use characters that are encoded differently with the different encoding or are not avaialbe at all in one of the encodings, e.g. ă, € or 😊.

Convert files between UTF-8 and ISO-8859 on Linux

Every time that I get confronted with Unicode, nothing works. I'm on Linux, and I got these files from Windows:
$file *
file1: UTF-8 Unicode text
file2: ISO-8859 text
file3: ISO-8859 text
Nothing was working until I found out that the files have different encodings. I want to make my life easy and have them all in the same format:
iconv -f UTF-8 -t ISO-8859 file1 > test
iconv: conversion to `ISO-8859' is not supported
Try `iconv --help' or `iconv --usage' for more information.
I tried to convert to ISO because that's only 1 conversion + when I open those ISO files in gedit, the German letter "ü" is displayed just fine. Okay, next try:
iconv -f ISO-8859 -t UTF-8 file2 > test
iconv: conversion from `ISO-8859' is not supported
Try `iconv --help' or `iconv --usage' for more information.
but obviously that didn't work.
ISO-8859-x (Latin-1) encoding only contains very limited characters, you should always try to encode to UTF-8 to make life easier.
And utf-8 (Unicode) is a superset of ISO 8859 so it will be not surprised you could not convert UTF-8 to ISO 8859
It seems command file just give a very limited info of the file encoding
You could try to guess the from encoding either ISO-8859-1 or ISO-8859-15 or the other from 2~14 as suggested in the comment by #hobbs
And you could get a supported encoding of iconv by iconv -l
If life treats you not easy with guessing the real file encoding, this silly script might help you out :D
As in other answers, you can list out the supported formats
iconv -l | grep 8859
A grep will save your time to find which version of your encoding is/are supported. You can provide the <number> as in my example or ISO or any expected string in your encoding.

How to remove non UTF-8 characters from text file

I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error:
Malformed UTF-8 character (fatal)
Manually checking the content of these files, I found some strange characters in them.
Now I'm looking for a way to automatically remove these characters from the files.
Is there anyway to do it?
This command:
iconv -f utf-8 -t utf-8 -c file.txt
will clean up your UTF-8 file, skipping all the invalid characters.
-f is the source format
-t the target format
-c skips any invalid sequence
Your method must read byte by byte and fully understand and appreciate the byte wise construction of characters. The simplest method is to use an editor which will read anything but only output UTF-8 characters. Textpad is one choice.
iconv
can do it
iconv -f cp1252 foo.txt
None of the methods here or on any other similar questions worked for me.
In the end what worked was simply opening the file in Sublime Text 2. Go to File > Reopen with Encoding > UTF-8. Copy the entire content of the file into a new file and save it.
May not be the expected solution but putting this out here in case it helps anyone, since I've been struggling for hours with this.

linux curl save as utf-8

Trying to use linux curl to download an xml file from an url.
Pretty sure that the xml is encoded in UTF-8,
suspecting curl -o doesnt save as UTF-8.
Is there anyway to force save to UTF-8 with curl ?
Thanks for the suggestion, what i found out:
Because the xml feed is dynamic, not all the time it contain any utf-8 characters.
Sometimes it doesnt have utf-8 character in the whole content at all even though it is set as utf-8 in the xml encoding and header content type: charset=utf-8. When it contain a utf-8 character at least, it will be save as utf-8.
When this happen, curl doesn't download as utf-8, which makes sense as there are no utf-8 chars, why is there a need to store as utf-8.
This is damn tricky, some validator has to valid against utf-8 hence i still need a solution to force it to utf8 because by default all my xml shld be in utf8-encoding.
tried the suggested by using iconv f iso8859-1 utf-8 doesnt work for this case as i am suspecting it is not in iso8859-1 either.
Still need a better solution.
curl does not do any conversion of the file it downloads. If the HTTP server serves you the XML in another encoding (e.g., ISO8859-1) that his how curl will save it to disk too.
To workaround your problem, you can use "iconv" as follows:
curl URL | iconv -f iso8859-1 -t utf-8 > output.xml
Hope this help.
Have you tried adding the Accept-Charset header? I had a similar issue downloading a file which was downloading with the wrong encoding. When I set the Accept-Charset header it works:
curl -H "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" URL | iconv -f iso8859-1 -t utf-8 > output.xml

Resources