Quick text-processing question. It's not necessarily related to programming, but this is the best place I figured I should go.
Rate down to tell me this kind of question is not welcome here. (Though, I really like my one little reputation point.)
Anyways, how can I encode text so that two characters get rendered in the same charspace?
NOTE: this is for plain-text -- nothing particularly complex.
The best you can do is put a backspace character between the two. However the outcome isn't likely to be useful to you, it will depend on what software is being used to display the text. The most likely is that the backspace will be ignored or shown as some generic "unavailable" glyph. The second most likely is that the second character will completely erase the first. You'd have to be very lucky for the two characters to be displayed one over the other in the same space.
If it's plain text to be processed by any editor, as far as I know you can't. Even if your text is encoded in Unicode, I don't think it provides combining characters for normal letters, but just for accents and similar symbols which are intended to be combined with other glyphs.
BTW, I'm not sure that stackoverflow is the right place for this kind of stuff, I'd see it better in superuser.com.
Related
I often come across the situation where I would like to read a file's original content in a human-readable way. When opening this kind of file in a text editor, why is it that it is usually gibberish with some complete and comprehensible text ? I would think that if the file is converted to something other than it's original written format, that there would be no comprehensible text remaining, yet I often find it is somewhere in between.
For example, I know that if I open a binary in a text format, there will be nothing comprehensible left that isn't purely accidental.
Example screencapture of partial gibberish text
Why is there complete text in here mixed with gibberish? Does that mean if I open the file with some sort of different encoding (I don't know what's possible), the file will come through as fully readable text? I would understand if it were all-or-nothing (either gibberish-non-readable OR human language) but I don't understand the in-between.
Please provide educational responses, rather than "because that's the way it is" type answers.
Those are formatting characters; there is no standard use and vary by the format of the file in question. You can still extract the text as needed with a fair knowledge of grep and regex, but it won't be fun. The best bet is to open the file with the software that can read it properly, as a text editor like gedit or Notepad++ will read the raw data and display that. Adobe's pdf format has text embedded, for instance, and all that gibberish is instructions for the Reader software for displaying it correctly on the screen while still allowing for relatively straightforward text extraction when required.
Editors have no real way to interpret the special formatting characters, and would need to be loaded with APIs for every conceivable program. They would also need to be updated constantly, since the formatting changes regularly for a variety of reasons. Many times, it is just to keep the files from being backward compatible with their own or other products, forcing an upgrade path. Microsoft is rather famous for that, but they are by far not the only company to do so.
I'm writing a pretty-printing combinator library, which means that I'm laying out tree-structured data in a (probably) monospaced font. To lay it out, I need to know how wide it's going to be. I'd like to do this without involving any particular rendering engine or font—chances are, it's going to be dumped to the terminal most of the time. Is there a correct way to do this?
From my reading, I’m thinking that grapheme clusters are one reasonable approximation. Is there a way to deal with characters that are typically fullwidth? What else do I need to worry about?
I'm working on an RTL text document and I'd like to switch the display to RTL. The man page doesn't seem to mention anything regarding direction, only encoding.
P.S. I saw other less related questions here (e.g. this), so I hope it's on topic.
less as such does not do this. While it can work with UTF-8 (see FAQ), RTL/LTR is a step further, and less portable. Actually "BIDI" may yield more possibilities than "RTL". But you have to pick through the possibilities. A web search for
less+pager+bidi
finds something that seems promising: LESS-bidi - Direction agnostic stylesheets, but (for whatever reason) the name LESS is misleading since that only deals with CSS for a browser. It has been dormant for nearly 3 years as well.
The Translate Shell page implies it has a workable viewer for BIDI text.
Ubuntu lists a package bidiv which might be useful.
I wonder if there is any known algorithm/strategy to add some noise to a text string (for instance, adding a random sequence of characters every now and then or something similar).
I don't want to completely destroy the text just to make it slightly unusable. Also, I'm not interested in reversing back the changes, I can just recreate the original text from the sources I used to create it in the first place if needed.
Of course, a very basic algorithm for doing this could be easyly implemented but probably somebody has already created a somewhat sophisticated algorithm for this. If a Java implementation of something like this is available even better.
If you are using .Net and you need some random bytes maybe try the GetBytes method from the rngcryptoprovider. Nice n random. You could also use it to help in selection random positions to update.
For example if I know that ć should be ć, how can I find out the codepage transformation that occurred there?
It would be nice if there was an online site for this, but any tool will do the job. The final goal is to reverse the codepage transformation (with iconv or recode, but tools are not important, I'll take anything that works including python scripts)
EDIT:
Could you please be a little more verbose? You know for certain that some substring should be exactly. Or know just the language? Or just guessing? And the transformation that was applied, was it correct (i.e. it's valid in the other charset)? Or was it single transformation from charset X to Y but the text was actually in Z, so it's now wrong? Or was it a series of such transformations?
Actually, ideally I am looking for a tool that will tell me what happened (or what possibly happened) so I can try to transform it back to proper encoding.
What (I presume) happened in the problem I am trying to fix now is what is described in this answer - utf-8 text file got opened as ascii text file and then exported as csv.
It's extremely hard to do this generally. The main problem is that all the ascii-based encodings (iso-8859-*, dos and windows codepages) use the same range of codepoints, so no particular codepoint or set of codepoints will tell you what codepage the text is in.
There is one encoding that is easy to tell. If it's valid UTF-8, than it's almost certainly no iso-8859-* nor any windows codepage, because while all byte values are valid in them, the chance of valid utf-8 multi-byte sequence appearing in a text in them is almost zero.
Than it depends on which further encodings may can be involved. Valid sequence in Shift-JIS or Big-5 is also unlikely to be valid in any other encoding while telling apart similar encodings like cp1250 and iso-8859-2 requires spell-checking the words that contain the 3 or so characters that differ and seeing which way you get fewer errors.
If you can limit the number of transformation that may have happened, it shouldn't be too hard to put up a python script that will try them out, eliminate the obvious wrongs and uses a spell-checker to pick the most likely. I don't know about any tool that would do it.
The tools like that were quite popular decade ago. But now it's quite rare to see damaged text.
As I know it could be effectively done at least with a particular language. So, if you suggest the text language is Russian, you could collect some statistical information about characters or small groups of characters using a lot of sample texts. E.g. in English language the "th" combination appears more often than "ht".
So, then you could permute different encoding combinations and choose the one which has more probable text statistics.