Unicode string columns in monospace - text

I'm writing a pretty-printing combinator library, which means that I'm laying out tree-structured data in a (probably) monospaced font. To lay it out, I need to know how wide it's going to be. I'd like to do this without involving any particular rendering engine or font—chances are, it's going to be dumped to the terminal most of the time. Is there a correct way to do this?
From my reading, I’m thinking that grapheme clusters are one reasonable approximation. Is there a way to deal with characters that are typically fullwidth? What else do I need to worry about?

Related

py3k default to byte instead of string

Just switching from Python2 to Python3 and the new string system is a real pain (or rather I'm not understanding its true benefit).
Is there any way to make it default to the old style bytes system without having to put a b before every string. I send a lot of commands via sockets and the code looks just ugly - i.e.
conn.sendall(b'k\n')
I tend to use this more than I worry about unicode
No, there isn't. And from what I gather you don't think it is a pain, and you do understand the benefit, you just think the b'' is ugly, which doesn't seem to be a very good reason to me.
Separating binary and text data is a great simplification in almost all cases. The need to prefix binary data with a b is a small price to pay for that.

How to determine codepage of a file (that had some codepage transformation applied to it)

For example if I know that ć should be ć, how can I find out the codepage transformation that occurred there?
It would be nice if there was an online site for this, but any tool will do the job. The final goal is to reverse the codepage transformation (with iconv or recode, but tools are not important, I'll take anything that works including python scripts)
EDIT:
Could you please be a little more verbose? You know for certain that some substring should be exactly. Or know just the language? Or just guessing? And the transformation that was applied, was it correct (i.e. it's valid in the other charset)? Or was it single transformation from charset X to Y but the text was actually in Z, so it's now wrong? Or was it a series of such transformations?
Actually, ideally I am looking for a tool that will tell me what happened (or what possibly happened) so I can try to transform it back to proper encoding.
What (I presume) happened in the problem I am trying to fix now is what is described in this answer - utf-8 text file got opened as ascii text file and then exported as csv.
It's extremely hard to do this generally. The main problem is that all the ascii-based encodings (iso-8859-*, dos and windows codepages) use the same range of codepoints, so no particular codepoint or set of codepoints will tell you what codepage the text is in.
There is one encoding that is easy to tell. If it's valid UTF-8, than it's almost certainly no iso-8859-* nor any windows codepage, because while all byte values are valid in them, the chance of valid utf-8 multi-byte sequence appearing in a text in them is almost zero.
Than it depends on which further encodings may can be involved. Valid sequence in Shift-JIS or Big-5 is also unlikely to be valid in any other encoding while telling apart similar encodings like cp1250 and iso-8859-2 requires spell-checking the words that contain the 3 or so characters that differ and seeing which way you get fewer errors.
If you can limit the number of transformation that may have happened, it shouldn't be too hard to put up a python script that will try them out, eliminate the obvious wrongs and uses a spell-checker to pick the most likely. I don't know about any tool that would do it.
The tools like that were quite popular decade ago. But now it's quite rare to see damaged text.
As I know it could be effectively done at least with a particular language. So, if you suggest the text language is Russian, you could collect some statistical information about characters or small groups of characters using a lot of sample texts. E.g. in English language the "th" combination appears more often than "ht".
So, then you could permute different encoding combinations and choose the one which has more probable text statistics.

Overlayed text?

Quick text-processing question. It's not necessarily related to programming, but this is the best place I figured I should go.
Rate down to tell me this kind of question is not welcome here. (Though, I really like my one little reputation point.)
Anyways, how can I encode text so that two characters get rendered in the same charspace?
NOTE: this is for plain-text -- nothing particularly complex.
The best you can do is put a backspace character between the two. However the outcome isn't likely to be useful to you, it will depend on what software is being used to display the text. The most likely is that the backspace will be ignored or shown as some generic "unavailable" glyph. The second most likely is that the second character will completely erase the first. You'd have to be very lucky for the two characters to be displayed one over the other in the same space.
If it's plain text to be processed by any editor, as far as I know you can't. Even if your text is encoded in Unicode, I don't think it provides combining characters for normal letters, but just for accents and similar symbols which are intended to be combined with other glyphs.
BTW, I'm not sure that stackoverflow is the right place for this kind of stuff, I'd see it better in superuser.com.

Advanced Text Rendering with Direct3D

Let me describe the "battlefield" of my task:
Multi-room audio/video chat with more than 1M users;
Custom Direct3D renderer;
What I need to implement is a TextOverVideo feature. The Text itself goes via network and is to be rendered on the recipient side with Direct3D renderer. AFAIK, it is commonly used in game development to create your own texture with letters/numbers and draw this items. Because our application must support many languages, we ought to use a standard. That's why I've been working with ID3DXFont interface but I've found out some unsatisfied limitations.
What I've faced is a lack of scalability. E.g. if user is resizing video window I have to RE-create D3DXFont with new D3DXFONT_DESC while he's doing that. I think it is unacceptable.
That is why the ONLY solution I see (due to my skills) is somehow render the text to a texture and therefore draw sprite with scaling, translation etc.
So, I'm not sure if I go into the correct direction. Please help with advice, experience, literature, sources...
Your question is a bit unclear. As I understand it, you want easily scalable font.
I think it is unacceptable
As far as I know, this is standard behavior for fonts - even for system fonts. They aren't supposed to be easily scalable.
Possible solutions:
Use ID3DXRenderTarget for rendering text onto texture. Font will be filtered when you scale it up too much. Some people will think that it looks ugly.
Write custom library that supports vector fonts. I.e. - it should be able to extract font outline from font, and build text from it. It will be MUCH slower than ID3DXFont (which is already slower than traditional "texture" fonts). Text will be easily scalable. Using this way, you are very likely to get visible artifacts ("noise") for small text. I wouldn't use that approach unless you want huge letters (40+ pixels). Freetype library may have functions for processing font outlines.
Or you could try using D3DXCreateText. This will create 3D text for ONE string. Won't be fast at all.
I'd forget about it. As long as user is happy about overall performance, improving font rendering routines (so their behavior looks nice to you) is not worth the effort.
--EDIT--
About ID3DXRenderTarget.
EVen if you use ID3DXRenderTarget, you'll need ID3DXFont. I.e. you use ID3DXFont to render text onto texture, and then use texture to blit text onto screen.
Because you said that performance is critical, you can delay creation of new ID3DXFont until user stops resizing video. I.e. When user starts resizing video, you use old font, but upscale it using texture. There will be filtering, of course. Once user stops resizing, you create new font when you have time. you probably can do that in separate thread, but I'm not sure about it. OR you could simply always render text in the same resolution as video. This way you won't have to worry about resizing it (it still will be filtered - along with the video). Some video players work this way.
Few more things about ID3DXFont. There is one problem with ID3DXFont - it is slow in situations where you need a lot of text (but you still need it, because it supports unicode, and writing texturefont with unicode support is pain). Last time I worked with it I optimized things by caching commonly used strings in the textures. I.e. any string that was drawn more than 3 frames in the row were rendered onto D3DFMT_A8R8G8B8 texture/render target, and then I've been copying that string from texture instead of using ID3DXFont. Strings that weren't rendered for a while, were removed from texture. That gave some serious boost. This solution, however is tricky - monitoring empty space in the texture, removing unused strings, and defragmenting the texture isn't exactly trivial (there is nothing exceptionally complicated, but it is easy to make a mistake). You won't need such complicated system unless your screen is literally covered by text.
ID3DXFont fonts are flat, always parallel to the screen. D3DXCreateText are meshes that can be scaled and rotated.
Texture fonts are fuzzy and don't look very clear. Not good for an app that uses lots of small text.
I am writing an app that can create 500 text meshes, each mesh averaging 3,000-5,000 vertices. The text meshes are created once, then are static. I get 700 fps on a GeForce 8800.

Antialiasing alternatives

I've seen antialiasing on Windows using GDI+, Java and also that provided by Photoshop and Gimp. Are there any other libraries out there which provide antialiasing facility without depending on support from the host OS?
Antigrain Geometry provides anti-aliased graphics in software.
As simon pointed out, the term anti-aliasing is misused/abused quite regularly so it's always helpful to know exactly what you're trying to do.
Since you mention GDI, I'll assume you're talking about maintaining nice crisp edges when you resize them - so something like a character in a font looks clean and not pixelated when you resize it 2x or 3x it's original size. For these sorts of things I've used a technique in the past called alpha-tested magnification - you can read the whitepaper here:
http://www.valvesoftware.com/publications/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf
When I implemented it, I used more than one plane so I could get better edges on all types of objects, but it covers that briefly towards the end. Of all the approaches (that I've used) to maintain quality when scaling vector images, this was the easiest and highest quality. This also has the advantage of being easily implemented in hardware. From an existing API standpoint, your best bet is to use either OpenGL or Direct3D - that being said, it really only requires bilinear filtered and texture mapped to accomplish what it does, so you could roll your own (I have in the past). If you are always dealing with rectangles and only need to do scaling it's pretty trivial, and adding rotation doesn't add that much complexity. If you do roll your own, make sure to pay particular attention to subpixel positioning (how you resolve pixel positions that do not fall on a full pixel, as this is critical to the quality and sometimes overlooked.
Hope that helps!
There are (often misnamed, btw, but that's a dead horse) many anti-aliasing approaches that can be used. Depending on what you know about the original signal and what the intended use is, different things are most likely to give you the desired result.
"Support from the host OS" is probably most sensible if the output is through the OS display facilities, since they have the most information about what is being done to the image.
I suppose that's a long way of asking what are you actually trying to do? Many graphics libraries will provide some form of antialiasing, whether or not they'll be appropriate depends a lot on what you're trying to achieve.

Resources