Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm struggling to deliver an project to a client. The job is to package files into an archive; simple, right? Well, the files have (and must have) french characters in their names. I'm archiving from the linux command line, she's opening from the desktop on windows.
At first, I tried 'zip', and it didn't work out. Character support appears to vary by implementation from what I've read here on StackOverflow. While unpacking, the resulting files didn't look right to me (Ubuntu Archive Manager) or to her ( WinZip, Windows ).
We next tried tar. Finally, things appear normal for me, but still not ok to the client ( trying PeaZip and 7zip for Windows).
Going into this, I really didn't expect this to be a problem. French speaking computer users must archive things, what are they using?
Any insight or assistance with this would be greatly appreciated. Thanks!
ZIP traditionally encodes filenames using IBM437 encoding. However to my knowledge, many tools (incorrectly) tend to use the default encoding on the system, which will likely cause problems in such a situation, because both ends might use different encodings.
In theory ZIP also supports UTF-8 by now, which should resolve these problems, but again tool-support will be the problem. For example as far as I know the ZIP support of Windows Explorer won't be able to handle UTF-8 encoded filenames.
So we end up with this: both ends have to agree about the encoding used for filenames and you will need an encoding that supports all the characters you have (any Unicode encoding will be fine, I'm not sure about IBM437 though). ZIP came a long way and thus there are many tools which tend to disagree about encoding. If possible, explicitly specify the encoding to use and prefer Unicode. In terms of compatibility with arbitrary tools you might be better off, using a newer format that is designed with Unicode in mind.
7-Zip supports it since 4.58 beta, according to the change log, but will only use it, when the local code page doesn't support the required characters. Using the -mcu command line switch will use UTF-8 for anything but ASCII. The local encodings usually differ only on the non-ASCII character range, so this will most likely do the trick. That is, if the tool used for unpacking also supports UTF-8 (which is more likely for 7-ZIP than for ZIP, because it isn't as old as ZIP and there are fewer unpacking tools).
WinRAR might also be worth a try.
Try using an archive program that allows you to specify the character encoding (say, UTF-8), or figuring out how to do it with the one you have. This forum thread might help you, because it's similar to what you're asking, albeit in reverse and for German rather than French: http://sourceforge.net/projects/sevenzip/forums/forum/45797/topic/3710172
Alternatively... You could nuke the accented characters. If francophones are on the receiving end of the file transfer, they may or may not be sympathetic (ask your users!).
French doesn't have that all many accents to worry about, really. You have [ae]-grave, e-aigue, [aeiou]-circumflex and c-cedilla to worry about, capital and lower (though that's more likely for the grave and aigue ones, unless someone hit the capslock key)
Tar has a --transform option. If you create a sed pattern to turn every iso-latin-1 accented aeiou and c character to the unaccented versions, you'll probably be okay.
I think you should go with compression in 7z format.
Under Linux it can be done using PeaZip, or by installing p7zip and using it through an UI like Ark or Filereoller depending on your desktop (I prefer PeaZip because it can be used on any desktop).
7z format was designed ground up with UTF8 in mind (the author is Russian), and in my exeperience it never failed.
Related
How can I read special characters from a external file ? Here a simple .txt file in French, which content is the first paragraph of https://fr.lipsum.com/ : as you can see on my screenshot, the file encoding is UTF-8 but the accents are not displayed correctly.
I tried various encodings within notepad++ and in my perl6 script, like these :
enc => "utf8"
enc => "latin1"
With Python or Ruby scripts I don't encounter the problem. I can't found any precise example about that matter, probably because perl 6 is still quite recent (??). Thank you.
My script as it is displayed in the screenshot :
my $text_contents = slurp "testfile.txt", enc => "utf8";
say $text_contents;
prompt;
Final edit : the solution is to enable an option, available in beta state with Windows 10 1803, to make the OS handle unicode characters properly : see answers and comments below ...
If you're not using Windows
This SO is either entirely or almost entirely irrelevant to you.
If you're using Windows 10
Check the "Beta: Use Unicode UTF-8 for worldwide language support" option checkbox.
At least at the time I originally wrote this answer, text near this Unicode related checkbox claimed it's for programs that do not support Unicode, but you should just ignore that.[1]
At the time I originally wrote this answer the checkbox was found under control panel, "Region" entry, "Administrative" tab, "Change system locale" button.
Microsoft may have changed this stuff since I wrote this answer, and may change it again, eg by moving and/or renaming the checkbox, or making things more involved than just clicking a single checkbox.
Per their comment below this answer, the OP notes:
For those who are interested in that particular option, it can be found in the "legacy" Control panel of windows -> Region -> Administrative -> Edit settings...
If you're using an older version of Windows
Arguably, the good news is that Raku and Rakudo have some of the world's best modern support for Unicode, and the OK news is that it relies on Microsoft correctly supporting Unicode, which they're now trying to do.
The bad news is that they made a lot of mistakes in older versions of Windows (and even in Windows 10, which they're now trying to fix), so any solution will be constrained by those mistakes. (Perhaps the biggest problem is Microsoft's doublespeak on the topic[1], but let's hope we can work around that.)
That all said, please read the following and then either return to searching for solutions or post a fresh SO question and we'll try to help.
Quoting Wikipedia's page Unicode in Microsoft Windows:
they are still in 2018 improving their operating system support for UTF-8
Microsoft got off on the wrong foot with their Unicode support last century. The good news is that they have at last begun digging their way out of the hole they dug for themselves and everyone else.
But they're definitely not there yet -- not at the time of originally writing this answer, and, I suspect not for another N years -- at least inasmuch as things don't work correctly out of the box for many end users. I think this is the root of most problems with Unicode on Windows.
Older languages like Python, Ruby and Perl came up with a range of hacks that hid the many problems with Microsoft's older UTF8 support from most users in simple scenarios by using what Microsoft ironically described as "Unicode support".
This has always come with the trade-off that things get very hairy or even completely unworkable for more complex applications in many locales around the world. (So much so that even the mighty Microsoft finally capitulated in 2018.)
In essence, until this new Microsoft effort to get with the program, software that ran on Windows has had no alternative but to either use their fundamentally broken "Unicode support" or to actually support Unicode properly.[1]
Raku and Rakudo focused on the latter, and problems with it when run on Windows are related to this conflicting with Microsoft's old broken approach. Fortunately Microsoft is now getting with the program and so we may be able find a way to get around problems you have with Unicode on Windows provided you are patient.
In particular, if you are using an older Windows version, please expect it to not work at first with modern Unicode aware software unless you are lucky. We'll still help if we can, but it'll likely involve you being patient with us and Microsoft and Rakudo and vice-versa.
Footnotes
[1] At the time I originally wrote this answer, there is text near the checkbox that it's for programs that do not support Unicode. This is entirely the opposite of what's really going on, but hey, it's Microsoft.
Finally! We're starting to require that all our input files are encoded in utf-8! This is something we've been wanting to do for years. Unfortunately, we suck at it since none of us have ever tried it and most of us are Windows programmers or are used to operating systems where utf-8 is the only real option anyway; neither group knows anything about reading utf-8 strings in a platform agnostic way.
So we started to look at how to deal with utf-8 in a platform agnostic way and found that its pretty confusing (because Windows) and the other questions I've found here on stackoverflow don't really seem to cover our scenario or they are confusing. I found a reference to https://www.codeproject.com/Articles/38242/Reading-UTF-with-C-streams which, I find, is a bit confusing and contains a great deal of fluff.
So a few assumptions (that must be true or we're in a state of GIGO)
All files are in utf-8 (yay!)
The std::strings must contain utf-8; no conversion allowed.
The solution must be locale agnostic and work on both macOS (10.13+), Windows (10+), Android and iOS 10+.
Stream support is not required; we're dealing with local files only (for now), but support for streams is appreciated.
We're trying to avoid using std::wstring if we can and I see no reason to use it anyway. We're also trying to avoid using any third party libraries which do not use utf-8 encoded std::string; using a custom string with functions that overloads and converts all std::string arguments to the a custom string is acceptable.
Is there any way to do this using just the standard C++ library? Preferably just by imbuing the global locale with a facet that tells the stream library to just dump content of files in strings (using custom delimiters as usual); no conversion allowed.
This question is only about reading utf-8 files into std::strings and storing the content as utf-8 encoded strings. Dealing with Windows APIs and such is a separate concern.
C++17 is available.
UTF-8 is just a sequence of bytes that follow a specific encoding. If you read a sequence of bytes that is legitimate UTF-8 data into a std::string, then the string contains UTF-8 data.
There's nothing special you have to actually do to make this happen. This works like any other C or C++ file loading. Just don't mess around with iostream locales and you'll be fine.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
On windows CHM is a very good option.
Is there anything other then delivering a static set of HTML pages and using a primitive call to a webbrowser (which is even a problem itself on linux). And it would not offer any kind of fulltext searching, separated bookmarks and even the simple fact of not opening a new tab for each help call.
The Gnome yelp program is what is used for GTK/Gnome applications. It supports a number of formats, but not CHM directly. They have started to define their own markup, named Mallard. But I don't know what is the status of that.
I'd still recommend static HTML as the best option (and of course man pages!). For example you can use Sphinx to write beautiful documentation with a full-text search support!
There are CHM viewers available on Linux though frankly as a Linux user I'd prefer to get static HTML pages.
Some examples are chmsee and kchmviewer.
Afaik there is no universal system. Depending on your desktop system (gnome/kde) there might be helpsystems, but they are usually based on loose files and use full-blown browsers. (usually webkit based)
For Lazarus a CHM based helpsystem and embedded browser was created, including CHM write support.
The reasons to avoid loose static html were mostly:
the 60000 lemma static documentation took too long to install on lighter systems or systems with specialist filesystems.
CHM removes slack and adds compression.
we also support non posix and OS X systems, and little filesystem related problems (charsets/encoding, separators, path depth etc) and case insenstive filesystems on *nix caused a lot of grief. The CHM based help solved that, allowing for one set of routines to access helpdata on all systems.
indexing and toc are Btree based, and can be easily merged runtime from independently produced help sets. In general integrating independently produced helpfiles is a underappreciated aspect of helpfiles in general, while key to open platforms.
native fulltext search.
An own viewer also has the ability to take advantage of extra features on top of the base system.
I'm not mentioning the Lazarus system in the hope you adapt it, since it is at the moment too much a development system (SDK) oriented system, the viewer is not even available as a separate package. I mainly mention it to illustrate the problems of loose html.
I haven't investigate KDE/Gnome/Eclipse what they use as helpsystem for a while though. If I would have to restart from scratch, that's where I would look first.
If I had to create something myself quickly, I would use zipped static html, and a single gziped file with metadata/indexes and the lightest browser (Konquerer?) I could find. Not ideal, not like Windows, but apparently the best Linux can offer.
I have a folder name in Japanese. CFileDialog getpathNameis returning some question marks when the folder is selected. Is there some way to solve it?
If your app is build with MBCS support rather than Unicode support, the japanese path will be handled correctly only if your "Language for non-Unicode programs" (aka system locale) is set to Japanese, which is the case for your Japanese users but might not be the case for you if you are not Japanese.
If your system locale is not Japanese, the path is translated to your codepage before it is returned by GetPathName(). It will either contain replacement (?) chars or garbage. Most likely a mix of both.
Here are a few possibilities available:
Don't do anything. Your app should work fine for Japanese most users. Or not...
Test your app under a Japanese codepage. To do so, either temporarily change your Language for non-Unicode programs (requires a reboot) or (much easier) test your app under AppLocale. (Note: Yes, it runs fine under Windows 7. This article may help if you have problems).
Switch it to Unicode. According to the size of your codebase, this can be a very tedious task mostly depending on inputs and outputs and whether you use _T("blah") string literals in your code. Of course, there are more aspects to it but these ones are the most important. BTW, all new projects should be done with Unicode support in mind.
Handle this path problem specifically. Since we're speaking of a file dialog, the whole dialog should be opened as Unicode. Which means you'll probably have to explicitely call the Unicode version of the underlying Win32 API rather than simply CFileDialog. It's not so complicated but the risk is that you are only solving the first problem in a row. After you have your Japanese path correctly, you'll have to deal with Japanese text input by user,... So I don't think this solution is a good one.
Solution #2 is certainly the quickest way to identify small issues. Solution #3 is for sure the best one on the long run. But make sure you actually need it because it may be tedious for existing apps.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
While most operating systems and web browsers have very good support for bidirectional text such as Hebrew and Arabic, most commercial and open-source software does not:
Most text editors, besides the original notepad and the visual studio editor, does a very poor job. (And I tried dozens of them).
I could not find any file compare tool doing a decent job - No even Beyond-Compare.
Same thing for software and packages dealing with charting and reporting.
Some questions I have:
Do you share the same pain I do?
Is the software you write bidirectional compliant? Do you have bug reports about it?
Do you even know what are the issues involved? Do you test for them?
Any suggestions on how to make the software world a better place for bidirectional language speakers?
Do you share the same pain I do?
No. And that's probably the answer: most people have no idea how bidirectional languages work. I for example have some troubles working with that. Because I'm interested in that topic quite a bit I was reading pango sources a while back, and that's probably the second reason why the support sucks: it's damn hard to get right.
I think the GNOME project has one of the best support for bidirectional user interfaces thanks to Pango (of course I can't verify that because I wouldn't be able to spot the problems).
But because you said "open source": I think the globalization support in open source projects is generally outstanding. Linux sucks are pretty much everything, but internationalization is something they get right.
gettext is still one of the few translation systems that has a (I know half baked but) working pluralization system.
Is the software you write bidirectional compliant? Do you have bug reports about it?
Probably not. I'm working on a web publishing software currently and that's one of the things I haven't tested at all so far :-(
Do you even know what are the issues involved? Do you test for them?
Bi-directional support is not no the direct roadmap. So no tests for them, where the issues are I know from the translation interface I wrote for Plurk.
Any suggestions on how to make the software world a better place for bidirectional language speakers?
For an open source project: ask guys to help you that know where the issues are. For closed source? Hire someone who knows.
I think there are two main answers to this:
1) Most languages read left-to-right, so people either think they can get away with not having it or just don't even think about it in the first place.
2) It can be hard to support it, depending on what your project is. If your tools/libraries don't support it, your software probably won't either. And it's not just hard in a programming sense, but hard to get it right when the programmers aren't familiar with right-to-left languages. As I understand it, to really properly support bi-directional text, some things in the UI must also be flipped to look "right."
The only reason I know anything about this is because I work with a guy who speaks Arabic as his native language and I've talked to him about it a little. I still don't know much about it. Our products only pretty recently started supporting Arabic and I haven't been a part of that effort.
Simple, get more bidirectional language speakers to voice their concerns! With so few bidirectional language users around, I'd imagine that bidirectional text support is pretty low on most people's priority lists. The more bug reports you and other bidirectional language speakers file, though, the more the problem will be addressed.
If you break up a string into substrings and display them individually you will break the OS bidi rendering, also if you add some mostly innocent symbols (like a - for example) you will mess up the text display.
The two things you have to know to write bidi-compatible software is:
Always display entire strings, never try to display parts of a larger string.
Always test any formatting code with bidi text.
And if you are writing a text editor, word processor or anything that requires high end typography and you can't follow rule 1 above then writing a bidi rendering engine is a lot of work.
I'm left-handed, and deal with similar issues in the physical world. It's a natural part of being in the minority, that businesses primarily cater to the majority.
If you think there are problems with bidirectional text, you should check out the Turkish i problem sometime..
Anyhow, I think what will happen is either that text processing will become very standardized, and the libraries will do things correctly, or you'll have to wait until the app becomes big enough to warrant adding good support..
I know ltr text in Flash is a pain in the ass - I've heard it's easier for web pages, although you've got to be careful how you process strings so they don't get mixed up.
This is an awfully subjective question, by the way, one that's impossible to find a 'solution' for - are you sure this is the right place to ask it?
I myself has been researching around on how to add native BiDi to Android. Results so far: lots of work, Android practically lacks real BiDi.
The issue is that the world of computers is all about internet and sharing, especially open-source software. This means dominant languages are the concern, and if you note english is actually the standard and other (mostly western) languages are provided as side translations.
I speak Arabic/Hebrew/English. With computers I use almost only englis, with arabic/hebrew for local stuff (news, online tv, ...) which is handled well by web browsers. However since I bought Samsung Galaxy and started updating firmware I starting noting how big the problem is :(
A note regarding some of the answers - There are no "bidirectional languages". a language is either left to right or right to left (or top to bottom...). A Text or a String can be bidirectional if it contains both say Hebrew and English.
Regarding the question, Firefox seem to work swell for me. Also MSWord and that's pretty much everything I use Hebrew in.
Any suggestions on how to make the software world a better place for bidirectional language speakers?
Unfortunately, I don't think the situation will improve unless there are a lot more RTL-language-speakers participating in global affairs... which seems unlikely.
Currently we have Israel which is a very technologically advanced society, but very small and nearly all the educated people speak English. And then there are the Arab countries and others that use Arabic script, which don't produce and consume nearly as much information as the Western world, according to studies I've seen.