unzip utf8 to ascii - linux

i took some files from linux hosting to my windows via ftp
and when i check file encodings utf8 without bom
now i need to convert those files back to ascii and send my other linux server
i zipped files
can i do something like
unzip if its text file and ut8 format than convert it to ascii
when i am unzipping files , i want to make conversion
thanks ?

The program you're looking for is iconv; it will convert between encodings. Use it like this:
iconv -f utf-8 -t ascii < infile > outfile
However. ASCII is a subset of UTF-8. That is, a file that's written in ASCII is also correct UTF-8 --- no conversion is needed. The only reason for needing to convert the other way is if there are characters in your UTF-8 file that are outside the ASCII range. And if this is the case, you can't convert it to ASCII, because ASCII doesn't have those characters!
Are you sure you mean ASCII? Pure ASCII is rare these days. ISO-8859-15 (Western European) or CP1252 (Windows) are much more common.

Related

Detect encoding in Python3 when BOM is present

For starters, I know that you can't always detect encoding, that's not the object of the question. Can Python detect the encoding when opening a file, when there's a BOM giving the encoding ?
I have a collection of files. Some are in UTF-16 (so they have a BOM saying so) and some are in latin 1. Let's ignore the edge case of latin-1 files that for some reason would start with the exact same characters as the UTF-16 BOM. I want, when opening the file, to look for a BOM. If there is one, automatically open the file using the encoding associated with the BOM. If there isn't a BOM, open with latin 1.
Here is my workaround :
with open(filename,mode="rb") as f:
text=f.readlines()
text=b''.join(text)
if text[:2]==(b'\xff\xfe'):
text=text.decode("utf-16")
else:
text=text.decode("iso-8859-1")
Is there some kind of module that replaces "open" to do that, with all encodings that have a BOM ?

Confirming the encoding of a file

I am outputting a file from SSIS in UTF-8 Encoding.
This file is passed to a third party for import into their system.
They are having a problem importing this file. Although they requested UTF-8 encoding, it seems they convert the encoding to ISO-8859-1. They use this command to convert the files encoding:
iconv -f UTF-8 -t ISO-8859-1 dweyr.inp
They are receiving this error
illegal input sequence at position 11
The piece of text causing the issue is:
ark O’Dwy
I think its the apostrophe, or whatever version of an apostrophe is used in this text.
The problem i face is that every text editor i try tells me the file is UTF-8 and renders it correctly.
The vendor is saying that this char is not UTF-8.
How can i confirm whom is correct?
The error message by iconv is a bit misleading, but kind-of correct.
It doesn't tell you that the input isn't valid UTF-8, but that it cannot be converted to ISO-8859-1 in a lossless way. ISO-8859-1 does not have a way to encode the ’ character.
Verify that by executing this command:
echo "ark O’Dwy" | iconv -f UTF-8 -t UTF-7
This produces the output that looks like "ark O+IBk-Dwy".
Here I'm outputting to UTF-7 (a very rarely used encoding that is useful for demonstration here, but little else).
In other words: the encoding is only "illegal" in the sense that it cannot be converted to ISO-8859-1, but it's a perfectly valid UTF-8 sequence.
If the third party claims to support UTF-8, then they may do so only very superficially. They might support any text that can be encoded in ISO-8859-1 as long as it's encoded in UTF-8 (which is an extremely low level of "UTF-8 support").

A Utf8 encoded file produces UnicodeDecodeError during parsing

I'm trying to reformat a text file so I can upload it to a pipeline (QIIME2) - I tested the first few lines of my .txt file (but it is tab separated), and the conversion was successful. However, when I try to run the script on the whole file, I encounter an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 16: invalid start byte
I have identified that the file encoding is Utf8, so I am not sure where the problem is arising from.
$ file filename.txt
filename: UTF-8 Unicode text, with very long lines, with CRLF line terminator
I have also reviewed some of the lines that are associated with the error, and I am not able to visually identify any unorthodox characters.
I have tried to force encode it using:
$iconv -f UTF8 -t UTF8 filename.txt > new_file.txt
However, the error produced is:
iconv: illegal input sequence at position 152683
How I'm understanding this is that whatever character occurs at the position is not readable/translatable using utf-8 encoding, but I am not sure then why the file is said to be encoded in utf-8.
I am running this on Linux, and the data itself are sequence information from the BOLD database (if anyone else has run into similar problems when trying to convert this into a format appropriate for QIIME2).
file is wrong. The file command doesn't read the entire file. It bases its guess on some sample of the file. I don't have a source ref for this, but file is so fast on huge files that there's no other explanation.
I'm guessing that your file is actually UTF-8 in the beginning, because UTF-8 has characteristic byte sequences. It's quite unlikely that a piece of text only looks like UTF-8 but isn't actually.
But the part of the text containing the byte 0x96 cannot be UTF-8. It's likely that some text was encoded with an 8-bit encoding like CP1252, and then concatenated to the UTF-8 text. This is something that shouldn't happen, because now you have multiple encodings in a single file. Such a file is broken with respect to text encoding.
This is all just guessing, but in my experience, this is the most likely explanation for the scenario you described.
For text with broken encoding, you can use the third-party Python library ftfy: fixes text for you.
It will cut your text at every newline character and try to find (guess) the right encoding for each portion.
It doesn't magically do the right thing always, but it's pretty good.
To give you more detailed guidance, you'd have to show the code of the script you're calling (if it's your code and you want to fix it).

Removing lines containing encoding errors in a text file

I must warn you I'm a beginner. I have a text file in which some lines contain encoding errors. By "error", this is what I get when parsing the file in my linux console (question marks instead of characters):
I want to remove every line showing those "question marks". I tried to grep -v the problematic character, but it doesn't work. The file itself is UTF8 and I guess some of the lines come from texts encoded in another format. I know I could find a way to reconvert them properly, but I just want them gone for now.
Do you have any ideas about how I could do this please?
PS: Some lines contain diacritics which are displayed fine. The "strings" command seems to remove too many "good" lines.
When dealing with mojibake on character encodings other than ANSI you must check 2 things:
Is the file really encoded in X? (X being UTF-8 WITHOUT BOM in your case. You could be trying to read UTF-8 WITH BOM, UTF-16, latin-1, etc. as UTF-8, and that would be the problem). Try reading in (not converting to) other encodings and see if any of them fits.
Is your locale or text editor set to read the file as UTF-8? If not, that may be the problem. Check for support and figure out how to change the setting. In linux try locale and setlocale commands to check and set it properly.
I like how notepad++ for windows (which also runs perfectly in linux using wine) lets you set any encoding you want to read the file without trying to convert it (of course if you set any other than the one the file is encoded in you will only see those weird characters), and also has a different option which allows you to convert it from one encoding to another. That has been pretty useful to me.
If you are a beginner you may be interested in this article. It explains briefly and clearly the whats, whys and hows of character encoding.
[EDIT] If the above fails, even windows-1252 and such ANSI encodings, I've just learned here how to remove non-ascii characters using tr unix command, turning it into ASCII (but be aware information on extra characters is lost in this output and there is no coming back, so keep the input file just in case you find a better fix):
tr -cd '\11\12\40-\176' < $INPUT_FILE > $OUTPUT_FILE
or, if you want to get rid of the whole line:
grep -v -P "[^\11\12\40-\176]" $INPUT_FILE > $OUTPUT_FILE
[EDIT 2] This answer here gives a pretty good guess of what could be happening if none of the encodings work on your file (Unfortunately the only straight forward solution seems to be removing those problematic characters).
You can use a micro-Perl script like:
perl -pe 's/[^[:ascii:]]+//g;' my_utf8_file.txt

How to substitute cp1250 specific characters to utf-8 in Vim

I have some central european characters in cp1250 encoding in Vim. When I change encoding with set encoding=utf-8 they appear like <d0> and such. How can I substitute over the entire file those characters for what they should be, i.e. Đ, in this case?
As sidyll said, you should really use iconv for the purpose. Iconv knows stuff. It knows all the hairy encodings, onscure code-points, katakana, denormalized, canonical forms, compositions, nonspacing characters and the rest.
:%!iconv --from-code cp1250 --to-code utf-8
or shorter
:%!iconv -f cp1250 -t utf-8
to filter the whole buffer. If you do
:he xxd
You'll get a sample of how to automatically encode on buffer load/save if you wanted.
iconv -l will list you all (many: 1168 on my system) encodings it accepts/knows about.
Happy hacking!
The iconv() function may be useful:
iconv({expr}, {from}, {to}) *iconv()*
The result is a String, which is the text {expr} converted
from encoding {from} to encoding {to}.
When the conversion fails an empty string is returned.
The encoding names are whatever the iconv() library function
can accept, see ":!man 3 iconv".
Most conversions require Vim to be compiled with the |+iconv|
feature. Otherwise only UTF-8 to latin1 conversion and back
can be done.
This can be used to display messages with special characters,
no matter what 'encoding' is set to. Write the message in
UTF-8 and use:
echo iconv(utf8_str, "utf-8", &enc)
Note that Vim uses UTF-8 for all Unicode encodings, conversion
from/to UCS-2 is automatically changed to use UTF-8. You
cannot use UCS-2 in a string anyway, because of the NUL bytes.
{only available when compiled with the +multi_byte feature}
You can set encoding to the value of your file's encoding and termencoding to UTF-8. See The vim mbyte documentation.

Resources