Reading utf-8 files to std::string in C++

Reading utf-8 files to std::string in C++ - string

Finally! We're starting to require that all our input files are encoded in utf-8! This is something we've been wanting to do for years. Unfortunately, we suck at it since none of us have ever tried it and most of us are Windows programmers or are used to operating systems where utf-8 is the only real option anyway; neither group knows anything about reading utf-8 strings in a platform agnostic way.
So we started to look at how to deal with utf-8 in a platform agnostic way and found that its pretty confusing (because Windows) and the other questions I've found here on stackoverflow don't really seem to cover our scenario or they are confusing. I found a reference to https://www.codeproject.com/Articles/38242/Reading-UTF-with-C-streams which, I find, is a bit confusing and contains a great deal of fluff.
So a few assumptions (that must be true or we're in a state of GIGO)
All files are in utf-8 (yay!)
The std::strings must contain utf-8; no conversion allowed.
The solution must be locale agnostic and work on both macOS (10.13+), Windows (10+), Android and iOS 10+.
Stream support is not required; we're dealing with local files only (for now), but support for streams is appreciated.
We're trying to avoid using std::wstring if we can and I see no reason to use it anyway. We're also trying to avoid using any third party libraries which do not use utf-8 encoded std::string; using a custom string with functions that overloads and converts all std::string arguments to the a custom string is acceptable.
Is there any way to do this using just the standard C++ library? Preferably just by imbuing the global locale with a facet that tells the stream library to just dump content of files in strings (using custom delimiters as usual); no conversion allowed.
This question is only about reading utf-8 files into std::strings and storing the content as utf-8 encoded strings. Dealing with Windows APIs and such is a separate concern.
C++17 is available.

UTF-8 is just a sequence of bytes that follow a specific encoding. If you read a sequence of bytes that is legitimate UTF-8 data into a std::string, then the string contains UTF-8 data.
There's nothing special you have to actually do to make this happen. This works like any other C or C++ file loading. Just don't mess around with iostream locales and you'll be fine.

Related

TI-Basic Problems

I recently bought a TI-84 Plus CE, and have been making programs using TI-BASIC.
I'm trying to make a simple text editor, and I need to convert character codes to characters. However, it seems that the char() command doesn't exist?
Please Help!

I don't believe that 84+ TI-BASIC supports ascii in this way (though I know that 68k BASIC had the ord() command) but one thing you could do is store all the typeable glyphs into a string (see: prgmGLYPHS on TI-Basic Developer for example) and then use inString() and sub() to store/retrieve their values. It's not pretty, and it's not fast, but it works. Here's an example using only the uppercase letters:
:"ABCDEFGHIJKLMNOPQRSTUVWXYZ→Str1
:Input ">",Str2
:Disp inString(Str1,Str2
:Input ">",N
:Disp sub(Str1,N,1
Note: The following pertains to my experience with the TI-84+ SE. The 84+ CE runs on a newer processor than the zilog Z80, so YMMV:
I expect what you're doing is storing your text in a List. Another thing that might be more efficient and secure is storing your text as an AppVar. These are allocated blocks of RAM/ROM that you can read to/write to at will...as long as you have a library to do it. With the 84+ SE you needed to use Celtic3 (or Doors CS, which includes Celtic as a library) to do that. I haven't used the 84+ CE enough to tell you what exists there, as the assembly code is entirely different. According to this reddit post the best way to do that is use the C toolchain, but I don't have experience with this either.

How to get system charset in Rust on Windows?

Working on Rust connector for TDengine, my problem is to get system charset in Rust. Which crate or which method should I use for this?

To get the ANSI codepage (that used for 8-bit text applications), use GetACP(). To get the OEM codepage (that used in consoles) use GetOEMCP().
I don't know that TDengine, but to keep your sanity you should avoid ANSI/OEM codepages and use UTF-8/Unicode whenever possible. The Rust OsString type makes this somewhat less painful.

Is it possible to extract constants and other predefined values from binary executables?

Let's say we have this program here
class Message{
public static SUPER_SECRET_STRING = "bar";
public static void Main(){
string SECRET = "foo";
Console.Write(sha(SUPER_SECRET_STRING) + "" + sha(SECRET));
}
}
Now, after building this program, is there any way using a hex editor or some other utility to extract the values "foo" and "bar" from the compiled binary file?
Also let's assume that memory editors are not allowed.
Is this applicable to all compiled languages like C++? What about ones that are run in another environment like Java or C#?

The answer from Mene is correct, but I wanted to put in my two cents to let you know how ridiculously easy it is to extract strings from compiled binaries (regardless of the language). If you have Linux, all you have to do is run the command strings <compiled binary> and you have the extracted strings. You don't have to be any sort of reverse engineer to pull this off. I just ran it against the eclipse binary on my Ubuntu machine and check out the (truncated) output:
> strings eclipse
ATSH
0[A\
8.uCH
The %s executable launcher was unable to locate its
companion shared library.
There was a problem loading the shared library and
finding the entry point.
setInitialArgs
-vmargs
-name
--launcher.library
--launcher.suppressErrors
--launcher.ini
eclipse
Notice how the string "The %s executable launcher was unable to locate its companion shared library. There was a problem loading the shared library and finding the entry point." appears in the output. This string is no doubt hard coded into the program.
When strings (and other data) are hard coded into a program, most compilers place them into a special section in the binary where they can be mapped directly into memory for access by the program as it needs them. If you were to open the binary with a hex editor, you could find this string easily.

Yes you could easily use a decompiler to extract those kinds of constants, especially strings (since they require a larger chunk of memory). This will even work in machine-code binaries and is even easier for VM-languages like Java and C#.
If you need to keep something secret in there you will need to go great lengths. Simply encrypting the string for example would add a layer of security, but for someone who knows what she does this won't be a big barrier. For example scanning the the file for places with uncommon entropy is likely to reveal the key which was used for encryption. There are even systems which encode secrets by altering the used low-level commands in the binary. Those tools replace certain combinations of commands with other equivalent commands. But even thous systems are not too hard to circumvent, as the uncommon combination of commands will reveal the use of such tools.
And even if you manage to protect the string by some kind of encryption in your binary, you will at some point require a decrypted version for your execution. Creating a memory-dump at a point in time where the string is used will thus also contain a copy of the secret value. This is especially problematic in Java as you cannot deallocate a chunk of memory and a string is immutable (meaning that a "change" to the string will lead to a new chunk of memory).
As you see the problem is far from trivial. And of course there is no way to give you 100% security (think of all the cracked games and so on).
Something that can be implemented in a secure way is using Public-key cryptography. In that case you will need to keep the private key hidden. That might be possible if you could for example send things to your server to encrypt them or you have hardware which provides a Trusted Platform Module. But those things might not be feasible for your case.

CFileDialog getpathName not reading Japanese

I have a folder name in Japanese. CFileDialog getpathNameis returning some question marks when the folder is selected. Is there some way to solve it?

If your app is build with MBCS support rather than Unicode support, the japanese path will be handled correctly only if your "Language for non-Unicode programs" (aka system locale) is set to Japanese, which is the case for your Japanese users but might not be the case for you if you are not Japanese.
If your system locale is not Japanese, the path is translated to your codepage before it is returned by GetPathName(). It will either contain replacement (?) chars or garbage. Most likely a mix of both.
Here are a few possibilities available:
Don't do anything. Your app should work fine for Japanese most users. Or not...
Test your app under a Japanese codepage. To do so, either temporarily change your Language for non-Unicode programs (requires a reboot) or (much easier) test your app under AppLocale. (Note: Yes, it runs fine under Windows 7. This article may help if you have problems).
Switch it to Unicode. According to the size of your codebase, this can be a very tedious task mostly depending on inputs and outputs and whether you use _T("blah") string literals in your code. Of course, there are more aspects to it but these ones are the most important. BTW, all new projects should be done with Unicode support in mind.
Handle this path problem specifically. Since we're speaking of a file dialog, the whole dialog should be opened as Unicode. Which means you'll probably have to explicitely call the Unicode version of the underlying Win32 API rather than simply CFileDialog. It's not so complicated but the risk is that you are only solving the first problem in a row. After you have your Japanese path correctly, you'll have to deal with Japanese text input by user,... So I don't think this solution is a good one.
Solution #2 is certainly the quickest way to identify small issues. Solution #3 is for sure the best one on the long run. But make sure you actually need it because it may be tedious for existing apps.

Modifying C++ DLL to support unicode - common pitfalls to avoid?

I have a windows DLL that currently only supports ASCII and I need to update it to work with Unicode strings. This DLL currently uses char* strings in a number of places, along with making a number of ASCII Windows API calls (like GetWindowTextA, RegQueryValueExA, CreateFileA, etc).
I want to switch to using the unicode/ascii macros defined in VC++. So instead of char or CHAR I'd use TCHAR. For char* I'd use LPTSTR. And I think things like sprintf_s would be changed to _stprintf_s.
I've never really dealt with unicode before, so I'm wondering if there are any common pitfalls I should look out for while doing this. Should it just be as simple as replacing the types and method names with the proper macros be enough, or are there other complications to look out for?

First read this article by Joel Spolsky: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then run through these links on Stack Overflow: What do I need to know about Unicode?
Generally, you are looking for any code that assumes one character = one byte (memory/buffer allocation, etc). But the links above will give you a pretty good rundown of the details.

The biggest danger is likely to be buffer sizes. If your memory allocations are made in terms of sizeof(TCHAR) you'll probably be OK, but if there is code where the original programmer was assuming that characters were 1 byte each and they used integers in malloc statements, that's hard to do a global search for.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string