Converting Scandinavian letters from wstring to string - string

Goal
Converting a wstring containing ÅåÄäÖöÆæØø into a string in C++.
Environment
C++17, Visual Studio Community 2017, Windows 10 Pro 64-bit
Description
I'm trying to convert a wstring to string and has implemented the solution suggested at
https://stackoverflow.com/a/3999597/1997617
// This is the code I use:
// Convert a wide Unicode string to an UTF8 string
std::string toString(const std::wstring &wstr)
{
if (wstr.empty()) return std::string();
int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
std::string strTo(size_needed, 0);
WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
return strTo;
}
So far so good.
My problem is that I have to handle scandinavian letters (ÅåÄäÖöÆæØø) in addition to the English ones. Regard the input wstring below.
L"C:\\Users\\BjornLa\\Å-å-Ä-ä-Ö-ö Æ-æ-Ø-ø\\AEther Adept.jpg"
When returned it has become...
"C:\\Users\\BjornLa\\Å-å-Ä-ä-Ö-ö Æ-æ-Ø-ø\\AEther Adept.jpg"
... which causes me some trouble.
Question
So I would like to ask an often asked question, but with a small addition:
How do I convert a wstring to a string when it contains Scandinavian characters?

So, I did some additional read-up and experimenting based on the comments I've got.
Turns at the solution is quite simple. Just change CP_UTF8 to CP_ACP!
However...
Microsoft suggest that one actually should use CP_UTF8, if you read between the lines at the MSDN method documentation. The note for CP_ACP reads:
This value can be different on different computers, even on the same
network. It can be changed on the same computer, leading to stored
data becoming irrecoverably corrupted. This value is only intended for
temporary use and permanent storage should use UTF-16 or UTF-8 if
possible.
Also, the note for the entire method reads:
The ANSI code pages can be different on different computers, or can be
changed for a single computer, leading to data corruption. For the
most consistent results, applications should use Unicode, such as
UTF-8 or UTF-16, instead of a specific code page, unless legacy
standards or data formats prevent the use of Unicode. If using Unicode
is not possible, applications should tag the data stream with the
appropriate encoding name when protocols allow it. HTML and XML files
allow tagging, but text files do not.
So even though this CP_ACP-solution works fine for my test-cases, it remains to see if it is an overall good solution.

Related

How do `Map<String, String>` get to know, the `String` endings in `MethodChannel` arguments

If dart and kotlin code communicate through binary(array of 8-bit integers (0-255)), then how does String end or even int end is represented in, or determined from binary sequence of bytes, is there some special charCode or something else.
Also is there a way to save a List<int> as-it-is to a file.txt, so it can be read directly to List<int> instead of serialization.
Please guide this new dev,
Thanking you...
Since Flutter handles the MethodChannel, in both the Dart side and Kotlin side, it can be allowed to have its own internal protocol to communicate between the native layer and Flutter. In theory they could use JSON but they are probably using something else based on the supported types and also making it more efficient: https://docs.flutter.dev/development/platform-integration/platform-channels?tab=type-mappings-kotlin-tab#codec
For saving a List<int> to a file, you need to determine how you want to encode the content in the file and then how you want to decode it. It can be as simply as just saving each number separated by comma or encode the list into JSON.
If your list of numbers can be represented with Uint8List or Int8List, then you can basically just save the numbers as raw bytes to the file and then read them again.
But List<int> is a list of 64-bit numbers and you should therefore determine how you want to encode this exactly.
For writing to files, there are several ways to do it but the specific way depends on what you exactly want. So without any more details I can just suggest you check the API: https://api.dart.dev/stable/2.17.3/dart-io/File-class.html

VC6 /r/n and Write works; Visual Studio 2013 does not work

the following code
if(!cfile.Open(fileName, CFile::modeCreate | CFile::modeReadWrite)){
return;
}
ggg.Format(_T("0 \r\n"));
cfile.Write(ggg, ggg.GetLength());
ggg.Format(_T("SECTION \r\n"));
cfile.Write(ggg, ggg.GetLength());
produces the following:
0 SECTI
clearly this is wrong: (a) \r\n is ignored, and (b) the word SECTION is cut off.
Can someone please tell me what I am doing wrong?
The same code without _T() in VC6 produces the correct results.
Thank you
a.
Apparently, you are building a Unicode build; CString (presumably that's what ggg is) holds a sequence of wchar_t characters, each two bytes large. ggg.GetLength() is the length of the string in characters.
However, CFile::Write takes the length in bytes, not in characters. You are passing half the number of bytes actually taken by the string, so only half the number of characters gets written.
Have you considered changing lines like:
cfile.Write(ggg, ggg.GetLength());
to`
cfile.Write(ggg, ggg.GetLength() * sizeof(TCHAR))
Write needs the number of bytes (not characters). Since Unicode is 2 bytes wide you need to account for that. sizeof(TCHAR) should be the number of bytes each character takes on a given platform. If it is built as Ansi it would be 1 and Unicode would have 2. Multiply that by the string length and the number of bytes should be correct.
Information on TCHAR can be found on MSDN documentation here. In particular it is defined as:
The _TCHAR data type is defined conditionally in Tchar.h. If the symbol _UNICODE is defined for your build, _TCHAR is defined as wchar_t; otherwise, for single-byte and MBCS builds, it is defined as char. (wchar_t, the basic Unicode wide-character data type, is the 16-bit counterpart to an 8-bit signed char.)
TCHAR and _TCHAR in your usage should be synonymous. However I believe these days Microsoft recommends including <tchar.h> and using _TCHAR. What I can't tell you is if _TCHAR existed on VC 6.0.
If using the method above - if you build using Unicode your output files will be in Unicode. If you build for Ansi it will be output as 8bit ASCII.
Want CFile.write to output Ascii no matter what? Read on...
If you want all text written to the file as 8bit ASCII you are going to have to use one of the macros for conversion. In particular CT2A. More on the macros can be found in this MSDN article. Each macro can be broken up by name, however CT2A says convert the Generic character string (equivalent to W when _UNICODE is defined, equivalent to A otherwise) to Ascii per the chart at the link. So no matter whether using Unicode or Ascii it would output Ascii. Your code would look something like:
ggg.Format(_T("0 \r\n"));
cfile.Write(CT2A(ggg), ggg.GetLength());
ggg.Format(_T("SECTION \r\n"));
cfile.Write(CT2A(ggg), ggg.GetLength());
Since the macro converts everything to Ascii CString's GetLength() will suffice.

string encoding error from delphi 7 to xe

i found on internet some source for crypt string and i saw that on delphi 7 the string is crypted and decrypted well, and when i try to do the same things with delphi xe2,xe3,xe4,xe5 the encryption end decryption fail with this error "invalid buffer size for decryption"
i'm using aes.pas and eiaes.pas from here: http://code.google.com/p/jalphi-lib/source/browse/trunk/codemyth/delphi/dontuseit/?r=7
i think that the problem is the enconding of string.
Is it possible to solve this problem?
The AES library you provided the link to has not been updated to support the fact that "String" in later versions of Delphi (Delphi 2009 onward) is now a UnicodeString where each character is a WideChar.
You have 4 options:
Contact the library author and ask if a Unicode version is
planned/available
Attempt to modify the library to support Unicode yourself (or find
someone who can/will help do this)
Find an alternative encryption library that already supports Unicode
Ensure that you only use ANSI Strings with the library.
This last option may not be viable for you but if it is then you will still need to modify the AES library, but you will not need to do any code changes.
The problem is that "String" and "Char" in the later versions of Delphi are "Wide" types (2 bytes per 'character'). This difference is almost certainly causing problems with code in the AES library that assumes only ONE byte per character.
You can make these assumptions valid by making sure that the AES code is working with ANSI Strings.
If you choose to do this, then I would suggest that you introduce two new types:
type
AESString = ANSIString;
AESChar = ANSIChar;
PAESChar = ^AESChar;
You will then need to go through the AES library code replacing any reference to "String" with "AESString", "Char" with "AESChar" and "PChar" with "PAESChar".
This should then make AES an ANSI String library and it will still be usable in Delphi 7 (i.e. pre-Delphi 2009) if that's important for you.
If you then find in the future that you do need to fully support Unicode strings and then need to properly fix the AES library code itself, you can do this and then simply change the AESString and AESChar types to:
type
AESString = String;
AESChar = Char;
If this is then compiled with a non-Unicode version of Delphi, the library will automatically revert back to ANSI String ("String" == ANSIString pre-D2009), so your Unicode changes will need to take this into account if you need to support both Unicode and non-Unicode versions of Delphi. You do need to be careful, but this is not difficult.

Convert char array to UNICODE in MFC C++

I'm using the folowing code to read files from a folder in windows. However since this a MFC application I have to convert the char array to UNICODE. For example if I hard code the path as "C:\images3\test\" as shown below the code works.
WIN32_FIND_DATA FindFileData;
HANDLE hFind = INVALID_HANDLE_VALUE;
hFind = FindFirstFile(_T("C:\\images3\\test\\"), &FindFileData);
What I want is to get this working as follows:
char* pathOfFileType;
hFind = FindFirstFile(_T(pathOfFileType), &FindFileData);
Can anyone tell me how to fix this problem ?
Thanks
Thanks a lot for all your responses. I learnt a lot from those answers because I also didn't have much idea about what is happening underneath. Meanwhile I managed to get rid of the issue by simply converting to UNICODE using the following code with minimum changes to my existing code.
#include <atlconv.h>
USES_CONVERSION;
//An ANSI string
LPSTR lpsz_ANSI_String = pathOfFileType;
//ANSI string being converted to a UNICODE string
LPWSTR lpUnicodeStr = A2W( lpsz_ANSI_String );
hFind = FindFirstFile(lpUnicodeStr, &FindFileData);
You can use the MultiByteToWideChar function to convert a string from chars to UTF-16, but you'd better to get pathOfFileType directly in Unicode from the user or from wherever you take it, otherwise you may still experience problems with paths that contain characters not included in the current CP.
Your question demonstrates a confusion of several issues. First, using MFC doesn't mean you have to convert the character array to Unicode, one has nothing to do with the other. Furthermore, FindFirstFile is a Win32 API, not an MFC function. Finaly, _T("abc") is not necessarily unicode, rather _T(X) is a macro that in multi-byte builds expands to X, and in unicode builds expands to L X, creating a wide character literal. This is designed so that your code can compile in a unciode or multi-byte configuration. To achieve the same flexibility when declaring a variable, you use the TCHAR type instead of char or wchar_t. So your second snippet should look like
TCHAR* pathOfFileType;
hFind = FindFirstFile(pathOfFileType, &FindFileData);
Note no _T macro, that is only applied to string literals, not identifiers.
"since this a MFC application I have to convert the char array to UNICODE"
Not so. If you wish, you can use change to use the Multi-Byte Character Set.
In project properties, general change character set to 'Use Multi-Byte Character Set'
Now this will work
char* pathOfFileType;
hFind = FindFirstFile(pathOfFileType, &FindFileData);
Supposing you want to use UNICODE ( visual studio's name for the 2 byte encoding of UNICODE characters native to Windows ) then you have to explicitly call the MBCS version of the API
char* pathOfFileType;
hFind = FindFirstFileA(pathOfFileType, &FindFileData);

How do I PInvoke a multi-byte ANSI string?

I'm working on a PInvoke wrapper for a library that does not support Unicode strings, but does support multi-byte ANSI strings. While investigating FxCop reports on the library, I noticed that the string marshaling being used had some interesting side effects. The PInvoke method was using "best fit" mapping to create a single-byte ANSI string. For illustration, this is what one method looked like:
[DllImport("thedll.dll", CharSet=CharSet.Ansi)]
public static extern int CreateNewResource(string resourceName);
The result of calling this function with a string that contains non-ASCII characters is that Windows finds a "close" character, generally this looks like it ends up being "???". If we pretend that 'a' is a non-ASCII character, then passing "cat" as a parameter would create a resource named "c?t".
If I follow the guidelines in the FxCop rule, I end up with something like this:
[DllImport("thedll.dll", CharSet=CharSet.Ansi, BestFitMapping = false, ThrowOnUnmappableChar = true)]
public static extern int CreateNewResource([MarshalAs(UnmanagedType.LPStr)] string resourceName);
This introduces a change in behavior; now when a character cannot be mapped an exception is thrown. This concerns me because this is a breaking change, so I'd like to try and marshal the strings as multi-byte ANSI but I cannot see a way to do so. UnmanagedType.LPStr is specified to be a single-byte ANSI string, LPTStr will be Unicode or ANSI depending on the system, and LPWStr is not what the library expects.
How would I tell PInvoke to marshal the string as a multibyte string? I see there's a WideCharToMultiByte() API function, could I change the signature to expect an IntPtr to a string I create in unmanaged memory? It seems like this still has many of the problems that the current implementation has (it still might have to drop or substitute characters), so I'm not sure if this is an improvement. Is there another method of marshaling that I'm missing?
ANSI is multi-byte, and ANSI strings are encoded according to the codepage currently enabled on the system. WideCharToMultiByte works the same way as P/Invoke.
Maybe what you're after is conversion to UTF-8. Although WideCharToMultiByte supports this, I don't think P/Invoke does, since it's not possible to adopt UTF-8 as the system-wide ANSI code page. At this point you'd be looking at passing the string as an IntPtr instead, although if you're doing that, you may as well use the managed Encoding class to do the conversion, rather than WideCharToMultiByte.
Here is the best way I've found to accomplish this. Instead of marshalling as a string, marshal as a byte[]. Put the responsibility on the caller of the pinvoke function API to convert to a byte array in the most appropriate fashion. Most likely by using one of the Text.Encoding classes.
If you end up having to call WideCharToMultiByte manually, I would get rid of the p/invoke and manually marshal this using WideCharToMultiByte in a C++/CLI wrapper function. Managed C++ is much better at these interop scenarios than C# is.
Though, if this is the only p/invoke you have, it's probably not worth it.

Resources