How can I merge these 3,500 mixed-charset text files?

How can I merge these 3,500 mixed-charset text files? - linux

I have about 3,500 text files, of mixed character sets: ISO-8859, UTF-8, ASCII, UTF-16, and maybe others.
I want to merge them all into one unicode text file, so I can run a Python script on it that expects it.
If I use cat, it doesn't exactly work.
What is the best way to solve this?

You could convert them up-front with a tool like iconv, or load them into Python with the correct encoding (by setting the correct encoding to open).
If you don't know what the encoding of each file is, then it is more complicated, because you'll need to detect the encoding of each file. There are many heuristics, but not absolutely standard way to do this. Again, using iconv can help a lot here.

Related

Preset encoding for Search-in-Files Feature

I have a huge Filedump to handle (7000+ Files) which are all encoded in OEM-US (and i need them to remain OEM-US or return to OEM-US when I'm done)
The search in files feature from Notepad++ would actually solve all my Problems. (It's a single use job - I don't want to bore you with the details but its about sanatizing old code which has partially been written in foreign languages like german or french including their notorious characters like äöüèéàç)
The thing is: Most of the time, Notepad++ detects the wrong encodings and different encodings for different files. Usually, it detects ANSI or UTF-8 but sometimes it get exotic and all of a sudden my files are supposed to be encoded in Shift-JIS or Big5 Which messes up my search terms as they sometimes turn different special chars into the same set of replacement chars.
So I'm looking for a way to either
a) Tell notepad++ which encoding to select for the "search in files" job i want to run.
b) convert all Files to UTF-8, run the search-replace job there and restore the encoding to OEM-US
or
c) Find a different Software to handle this issue for me
Can someone help me?

python reverse unicode text into readable

i believe i have similar problem to this how to convert unicode text to utf8 text readable? but i want a python 3.7 solution to it
i am a complete newbie, i have some experience with python so i am trying to use it to make a script that will convert a Unicode file into the previous readable text it was.
the file is a bookmark file i have recovered using easeusa then i opened the bookmark file and it is writen in unicode something like "&PŽ¾³kÊ
k-7ÄÜÅe–?XBdyÃ8ß¯r×»Êã¥bÏñ ¥»X§ÈÕÀ¬Zé‚1öÄEdýŽ‹€†.c"
whereas previously is said something like " "checksum": "112d56adbd0caa2b3693bb0442dd16ff",
"roots": {
"bookmark_bar": {
"children":"
fyi when i click save as for the unicode bookmark file, for unicode it has ANSI and not utf-8 maybe it was saved us ANSI, i might be waffling here but i'm just trying to give you all the information you might need to help me
i am a newbie who depressingly need help

This text isn't "Unicode". It's simply gibberish.
This file has been corrupted -- it may have been overwritten with other data before you were able to recover it. It is unlikely to be recoverable.

How to convert "binary text" to "visible text"?

I have a text file full of non-ASCII characters.
I can not detect the encoding by either file or enca.
file non_ascii.txt
non_ascii.txt: Non-ISO extended-ASCII text
enca non_ascii.txt
Unrecognized encoding
But I can open it normally in Windows Notepad++
Edit: The expression above leads misunderstanding. Sorry for this.
In fact, I picked some parts of the original file and put them into new text file, then opened in notepad++.
The 2 parts shows as below. They are decoded in 2 different ways by notepad++.
Question:
How could I detect the files encoding under linux?
how do I recover the characters represented by <F1><EE><E9><E4><FF>?
I couldn't get result by "grep 'сойдя' win.txt" even though the "сойдя" is encoded into <F1><EE><E9><E4><FF>?
The file content slice as follows:
less non_ascii.txt
"non_ascii.txt" may be a binary file. See it anyway?
<F1><EE><E9><E4><FF>
<F2><F0><E0><EA><F2><EE><E2><E0><F2><FC><F1><FF>
<D0><F2><E9><E4><D7><E9><E7><E1><EC><E1><F3><F8>
<D1><E5><EA><F3><ED><E4>
<F0><E0><E7><E3><F0><F3><E7><EA><E8>
<EF><EE><E4><F1><F2><E0><E2><EB><FF><F2><FC>
<F0><E0><E7><E3><F0><F3><E7><EA><E5>
<F1><EE><E9><E4><F3>
<F0><E0><E7><E3><F0><F3><E7><EA><E0>
<F1><EE><E2><EB><E0><E4><E0><EB><E8>
<C1><D7><E9><E1><F0><EF><FE><F4><E1>
<CB><C1><D3><D3><C9><D4><C5><D2><C9><D4>
<F1><EE><E2><EB><E0><E4><E0><EB><EE>
<F1><EE><E9><E4><E8>
<F1><EE><E2><EB><E0><E4><E0><EB><E0>

Your question really has two parts: (1) how do I identify an unknown encoding and (2) how do I convert that to something useful?
The first part is the real challenge, and really cannot be answered in universal terms -- in the general case, there is no reliable way to identify an unknown 8-bit encoding. Some encodings give you good hints (UTF-8 is an excellent example) and in many cases, if you have a good idea what the text is supposed to represent, the problem can be solved.
A mapping of 8-bit character meanings can be helpful (cough, the link is to mine) and in this case quickly hints at Windows code page 1251. Kudos for the hex dumps and the picture with the representation you expect!
With that out of the way, converting is easy.
iconv -f cp1251 -t utf-8 non_ascii.txt >utf8.txt
Provided your Linux system is set up to use UTF-8 at the terminal, your grep command should work on utf-8.txt now.
The indication that some of the text is "ANSI" (which is a bogus term anyway) is probably just a red herring -- as far as I can tell, everything in your excerpt looks like well-formed CP1251.
Some tools like chardet do a reasonable job of at least steering you in the right direction, though you have to understand that, like a human expert, they have to guess what the text is supposed to represent. There are corner cases where they just don't have enough information to guess correctly, either because there are several candidate encodings with very few differences (for example, Latin-1 vs Latin-9 vs Windows-1252, all of which also overlap with plain 7-bit US-ASCII in the first 128 positions) or because the input doesn't contain enough information to establish any common patterns.

delete special characters preceding shebang (M-oM-;M-?#!/bin/bash) [duplicate]

I have a CSV file with special accents and save it in Notepad by selecting UTF-8 encoding. When I read the file using Java, it reads the BOM characters too.
So I want to save this file in UTF-8 format without appending a BOM initially in Notepad.
Otherwise, is there a built-in class in Java that eliminates the BOM characters that present at beginning, when reading the contents in a file?

Use Notepad++ - it is free and much better than Notepad. It will help to save text without a BOM using Encoding → Encode in UTF-8 without BOM: Notepad++ v6 and olders:
Notepad++ v7+:
When I encountered this problem in Java, I didn't find any library to parse these first three bytes (BOM). So my advice:
Use PushbackInputStream(in, 3).
Read the first three bytes
If it's not BOM (EF BB BF), push them back
Process the stream as UTF-8

I just learned from this Stack Overflow post, as #martin-geisler points out, that you can save files without the BOM in Windows Notepad, by selecting ANSI as the encoding.
I'm assuming that for more advanced uses this won't work because the resulting file is probably not the end encoding wished, but actually ANSI; but I tested and confirmed this works to save a very small .php script without BOM using only Notepad.
I learned the long, hard way that Windows' Notepad is not a true editor, although I'd like to point out for others that, despite this, it is misleadingly called up when you type "editor" on newer Windows machines, at least on one of mine.
I am currently using Emacs and other editors to solve this problem.

Use Notepad++ instead. See my personal blog post on it. From within Notepad++, choose the "Encoding" menu, then "Encode in UTF-8 without BOM".

Notepad on Windows 10 version 1903 (May 2019 update) and later versions supports saving to UTF-8 without a BOM. In fact, UTF-8 is the default file format now.
Reference: Windows 10 Notepad is Getting Better UTF-8 Encoding Support

The answer is: Not at all. Notepad can't do that.
In Java you can just skip the first byte in your InputStream and be done.

You might want to try out Notepad2 or Notepad++. Those Notepad replacements have the option for you to choose whether to output BOM.
As for a Java solution, as far as I know, Java does not understand the standard UTF-8. I googled and found Java's UTF-8 and Unicode writing is broken - Use this fix that might be the solution.

We're using the utility BOMStripperInputStream.java to strip the BOM from our input if present.

I exported via mysqldump to a file. How do I find out the file encoding of the file?

Given a text file in ubuntu (or debian unix in general), how do I find out the file encoding of the file ? Can I run od or hexdump on it to fingerprint its encoding ? What should I be looking out for ?

There are many tools to do this. Try a web search for "detect encoding". Here are some of the tools I found:
The Internationalizations Classes for Unicode (ICU) are a great place to start. See especially their page on Character Set Detection.
Chardet is a Python module to guess the encoding
of a file. See chardet.feedparser.org
The *nix command-line tool file detects file types, but might also detect encodings if mentioned in the file (e.g. if there's a mime-type notation in
the file). See man file
Perl modules Encode::Detect and Encode::Guess .
Someone asked a similar question in StackOverflow. Search for the question, PHP: Detect encoding and make everything UTF-8. That's in the context of fetching files from the net and using PHP, but you could write a command-line PHP script.
Note well what the ICU page says about character set detection: "Character set detection is ..., at best, an imprecise operation using statistics and heuristics...." In my experience the problem domain makes a big difference in how easy or difficult the job is. Don't forget that it's possible that the octets in a file can be of ambiguous encoding, i.e. sensibly interpreted using multiple different encodings. They can also be of mixed encoding, i.e. different subsets of the octets make sense interpreted in different encodings. This is why there's not a single command-line tool I can recommend which always does the job.
If you have a single file and you just want to get it into a known encoding, my trick is to open the file with a text editor which can import using a bunch of different encodings, such as TextWrangler or OpenOffice.org. First, open the file and let the editor guess the encoding. Take a look at the result. If you aren't satisfied with it, guess an encoding, open the file with the editor specifying that encoding, and take a look at the result. Then save as a known encoding, e.g. UTF-16.

You can use enca. Enca is a small command line tool for encoding detection and convertion.
You can install it at debian / ubuntu by:
apt-get install enca
In order to use it, just call
enca FILENAME
Also see the manpage for more information.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How can I merge these 3,500 mixed-charset text files? - linux

I have about 3,500 text files, of mixed character sets: ISO-8859, UTF-8, ASCII, UTF-16, and maybe others. I want to merge them all into one unicode text file, so I can run a Python script on it that expects it. If I use cat, it doesn't exactly work. What is the best way to solve this?

Related

Preset encoding for Search-in-Files Feature

python reverse unicode text into readable

How to convert "binary text" to "visible text"?

delete special characters preceding shebang (M-oM-;M-?#!/bin/bash) [duplicate]

I exported via mysqldump to a file. How do I find out the file encoding of the file?

Categories

Resources