How do I PInvoke a multi-byte ANSI string? - string

I'm working on a PInvoke wrapper for a library that does not support Unicode strings, but does support multi-byte ANSI strings. While investigating FxCop reports on the library, I noticed that the string marshaling being used had some interesting side effects. The PInvoke method was using "best fit" mapping to create a single-byte ANSI string. For illustration, this is what one method looked like:
[DllImport("thedll.dll", CharSet=CharSet.Ansi)]
public static extern int CreateNewResource(string resourceName);
The result of calling this function with a string that contains non-ASCII characters is that Windows finds a "close" character, generally this looks like it ends up being "???". If we pretend that 'a' is a non-ASCII character, then passing "cat" as a parameter would create a resource named "c?t".
If I follow the guidelines in the FxCop rule, I end up with something like this:
[DllImport("thedll.dll", CharSet=CharSet.Ansi, BestFitMapping = false, ThrowOnUnmappableChar = true)]
public static extern int CreateNewResource([MarshalAs(UnmanagedType.LPStr)] string resourceName);
This introduces a change in behavior; now when a character cannot be mapped an exception is thrown. This concerns me because this is a breaking change, so I'd like to try and marshal the strings as multi-byte ANSI but I cannot see a way to do so. UnmanagedType.LPStr is specified to be a single-byte ANSI string, LPTStr will be Unicode or ANSI depending on the system, and LPWStr is not what the library expects.
How would I tell PInvoke to marshal the string as a multibyte string? I see there's a WideCharToMultiByte() API function, could I change the signature to expect an IntPtr to a string I create in unmanaged memory? It seems like this still has many of the problems that the current implementation has (it still might have to drop or substitute characters), so I'm not sure if this is an improvement. Is there another method of marshaling that I'm missing?

ANSI is multi-byte, and ANSI strings are encoded according to the codepage currently enabled on the system. WideCharToMultiByte works the same way as P/Invoke.
Maybe what you're after is conversion to UTF-8. Although WideCharToMultiByte supports this, I don't think P/Invoke does, since it's not possible to adopt UTF-8 as the system-wide ANSI code page. At this point you'd be looking at passing the string as an IntPtr instead, although if you're doing that, you may as well use the managed Encoding class to do the conversion, rather than WideCharToMultiByte.

Here is the best way I've found to accomplish this. Instead of marshalling as a string, marshal as a byte[]. Put the responsibility on the caller of the pinvoke function API to convert to a byte array in the most appropriate fashion. Most likely by using one of the Text.Encoding classes.

If you end up having to call WideCharToMultiByte manually, I would get rid of the p/invoke and manually marshal this using WideCharToMultiByte in a C++/CLI wrapper function. Managed C++ is much better at these interop scenarios than C# is.
Though, if this is the only p/invoke you have, it's probably not worth it.

Related

Types for strings escaped/encoded differently

Recently I am dealing with escaping/encoding issues. I have a bunch of APIs that receive and return Strings encoded/escaped differently. In order to clean up the mess I'd like to introduce new types XmlEscapedString, HtmlEscapedString, UrlEncodedString, etc. and use them instead of Strings.
The problem is that the compiler cannot check the encoding/escaping and I'll have runtime errors.
I can also provide "conversion" functions that escape/encode input as necessary. Does it make sense ?
The compiler can enforce that you pass the types through your encoding/decoding functions; this should be enough, provided you get things right at the boundaries (if you have a correctly encoded XmlEscapedString and convert it to a UrlEncodedString, the result is always going to be correctly encoded, no?). You could use constructors or conversion methods that check the escaping initially, though you might pay a performance penalty for doing so.
(Theoretically it might be possible to check a string's escaping at compile time using type-level programming, but this would be exceedingly difficult and only work on literals anyway, when it sounds like the problem is Strings coming in from other APIs).
My own compromise position would probably be to use tagged types (using Scalaz tags) and have the conversion from untagged String to tagged string perform the checking, i.e.:
import scalaz._, Scalaz._
sealed trait XmlEscaped
def xmlEscape(rawString: String): String ## XmlEscaped = {
//perform escaping, guaranteed to return a correctly-escaped String
Tag[String, XmlEscaped](escapedString)
}
def castToXmlEscaped(escapedStringFromJavaApi: String) = {
require(...) //confirm that string is properly escaped
Tag[String, XmlEscaped](escapedStringFromJavaApi)
}
def someMethodThatRequiresAnEscapedString(string: String ## XmlEscaped)
Then we use castToXmlEscaped for Strings that are already supposed to be XML-escaped, so we check there, but we only have to check once; the rest of the time we pass it around as a String ## XmlEscaped, and the compiler will enforce that we never pass a non-escaped string to a method that expects one.

ServiceStack does not escape control characters in JSON

ServiceStack's JsonSerializer does not seem to encode control characters correctly.
For example, this C# expression....
JsonSerializer.SerializeToString(new { Text = "\u0010" })
... evaluates to this...
{"Text":"?"}
... where the "?" is the literal control character.
Instead, according to http://www.json.org it should evaluate to this:
{"Text":"\u0010"}
Is this a known bug or am I missing something?
The bad JSON output by my services is causing errors during deserialization by my service consumers.
You need to tell the serializer to escape unicode characters.
JsConfig.EscapeUnicode = true;
JsonSerializer.SerializeToString(new{Text = "\u0010"});
The above evaluates to this:
{"Text":"\u0010"}
Thanks Mike, that works. But I think this approach escapes ALL non-ASCII Unicode characters in addition to control characters.
I'm expecting to have a lot of foreign language characters in my data (Arabic, for example) so this will cause significant size bloat versus just including those unescaped unicode characters in the JSON (which is still standard-compliant).
I imagine the purpose of EscapeUnicode = true is to produce JSON that can be stored or transmitted with simple ASCII encoding, which is certainly useful. And it apparently also encodes ASCII control characters as a side-effect which does solve my problem.
But in my opinion, JsonSerializer should escape control characters regardless of the EscapeUnicode setting since the standard requires it. I consider this a bug.
Since this is primarily a problem for me within my Service Stack services I also found this solution:
SetConfig(new EndpointHostConfig
{
UseBclJsonSerializers = true
});
This tells Service Stack to use .NET's built-in DataContractJsonSerializer instead of Service Stack's JsonSerializer. I have verified that DataContractJsonSerializer does escape control.characters correctly.
So it appears that I need to choose between JsonSerializer with EscapeUnicode = true (faster but with bloated output) and DataContractJsonSerializer (slower but with compact Unicode output).

Delphi: Upgrade from 6 to XE2 - TStringList

We have to upgrade to XE2 (from Delphi6).
I collected many informations about this, but one of them isn't clear for me.
We are using String - what is AnsiString in XE.
As I know we must replace all (P)Ansi[String/Char] in our libraries to avoid the side effects of Unicode converts, and to we can compile our projects.
It is ok, but we are also using TStringList, and I don't found any TAnsiStringList class to change it simply... ;-)
What do you know about this? Can this cause problems too? Or this class have an option to preserve the strings?
(Ok, it seems to be 3 questions, but it is one only)
The program / OS language is hungarian, the charset is WIN-1250, what have some strange characters, like Ő, and Ű...
Thanks for your every information, link, etc.
1) 1st of all - WHY should u use AnsiStringList, rather than converting all your project to unicode-aware TStringList ? That should have certain detailed reasons, to suggest viable alternatives.
Unicode is a superset of windows-1250, windows-1251 and such.
Normally all you locale-specific string would be just losslessly converted to Unicode. IT is the opposite, Unicode to AnsiString, convertion that may loose data.
Explicit or implicit (like AnsiChar reduction in "if char-var in char-set")
You may have type-unsafe API like in DLLs, where compiler cannot check if you pass PChar or PAnsiChar, but you anyway should not pass objects liek TStrings into DLLs, there are BPLs for that.
So you probably just do not need TAnsiStringList
2) you can take TJclAnsiStringList from Jedi Code Library
3) You can use XE2 stock TList<AnsiString> type

Delphi Embarcadero XE: tons of warnings with String and PAnsiChar

I'm trying to migrate from Delphi 2007 to Embarcadero RAD Studio XE.
I'm getting tons of warnings.
They all look like this: I have a procedure where I declare a "String":
procedure SendMail( ADestinataire,ASubject : String);
And I'm trying to call Windows API like:
Res := MAPIResolveName(Session, Application.Handle,
PAnsiChar(ADestinataire), MAPI_LOGON_UI, 0, PRecip);
So the warnings are:
W1044: Transtyping string into PAnsiChar suspect.
What am I doing wrong / how should I correct this (350 warnings...)?
Thank you very much
MAPIResolveName uses LPSTR parameter, which is PAnsiChar in Delphi. Simple MAPI does not support UTF16 strings (though it can be used with UTF8 strings), so if you adhere to simple MAPI you should use AnsiStrings, for example
procedure SendMail( ADestinataire,ASubject : AnsiString);
or better you can use
procedure SendMail( ADestinataire,ASubject : String);
and explicitly convert string parameters to AnsiStrings prior to calling MAPIResolveName
Update
The whole Simple MAPI is now deprecated; Simple MAPI can be used with UTF8 strings, but it requires some changes in code and registry.
So if the question is about quick porting old ANSI Simple MAPI to Unicode Delphi, the best is to adhere to AnsiStrings.
A more solid approach is to abandon Simple MAPI completely and use Extended MAPI instead.
Just write
Res := MAPIResolveName(Session, Application.Handle,
PChar(ADestinataire), MAPI_LOGON_UI, 0, PRecip);
If you have a string, that is, a Unicode string, that is, a pointer to a sequence of Unicode characters, you shouldn't cast it to PAnsiChar. PAnsiChar is a pointer to a sequence of non-Unicode characters. Indeed, a cast to a PSomethingChar type simply tells the compiler to interpret the thing inside the cast as a pointer of the specified type. It doesn't do any conversion. So, basically, right now you lie to the compiler: You have a Unicode string and instructs the compiler to interpret it as an ANSI (non-Unicode) string. That's bad.
Instead, you should cast it to PWideChar, a pointer to a sequence of Unicode characters. In Delphi 2009+, PChar is equivalent to PWideChar.
Of course, if you send a pointer to a sequence of Unicode characters to the function, then the function had better expect Unicode characters, but I am not sure if this is the case of the MAPIResolveName function. I suspect that it actually requires ANSI (that is, non-Unicode) characters. If this is the case, you need to convert the Unicode string to an ANSI (non-Unicode) string. This is easy, just write AnsiString(ADestinataire). Then you cast to a ANSI (non-Unicode) PAnsiChar:
Res := MAPIResolveName(Session, Application.Handle,
PANsiChar(AnsiString(ADestinataire)), MAPI_LOGON_UI, 0, PRecip);
Starting with Delphi 2009, the string data type is now implemented as a Unicode string. Previous versions implemented the string data type as an Ansi string (i.e. one byte per character).
This had significant implications when we ported our apps from 2007 to XE, and these article were very helpful:
Delphi in a Unicode World (first of three parts)
Delphi and Unicode
Delphi Unicode Migration for Mere Mortals
To step back a moment:
When using use the cast
============ ================
String PChar
AnsiString PAnsiChar
WideString PWideChar
String used to be an alias for AnsiString, which means it happened to work if you used PAnsiChar (even though you should have used PChar).
Now string is an alias for UnicodeString, which means using PAnsiChar (which was always wrong) is now really wrong.
Knowing this you can solve the question:
ADestinataire: String
PAnsiChar(ADestinataire)

Convert char array to UNICODE in MFC C++

I'm using the folowing code to read files from a folder in windows. However since this a MFC application I have to convert the char array to UNICODE. For example if I hard code the path as "C:\images3\test\" as shown below the code works.
WIN32_FIND_DATA FindFileData;
HANDLE hFind = INVALID_HANDLE_VALUE;
hFind = FindFirstFile(_T("C:\\images3\\test\\"), &FindFileData);
What I want is to get this working as follows:
char* pathOfFileType;
hFind = FindFirstFile(_T(pathOfFileType), &FindFileData);
Can anyone tell me how to fix this problem ?
Thanks
Thanks a lot for all your responses. I learnt a lot from those answers because I also didn't have much idea about what is happening underneath. Meanwhile I managed to get rid of the issue by simply converting to UNICODE using the following code with minimum changes to my existing code.
#include <atlconv.h>
USES_CONVERSION;
//An ANSI string
LPSTR lpsz_ANSI_String = pathOfFileType;
//ANSI string being converted to a UNICODE string
LPWSTR lpUnicodeStr = A2W( lpsz_ANSI_String );
hFind = FindFirstFile(lpUnicodeStr, &FindFileData);
You can use the MultiByteToWideChar function to convert a string from chars to UTF-16, but you'd better to get pathOfFileType directly in Unicode from the user or from wherever you take it, otherwise you may still experience problems with paths that contain characters not included in the current CP.
Your question demonstrates a confusion of several issues. First, using MFC doesn't mean you have to convert the character array to Unicode, one has nothing to do with the other. Furthermore, FindFirstFile is a Win32 API, not an MFC function. Finaly, _T("abc") is not necessarily unicode, rather _T(X) is a macro that in multi-byte builds expands to X, and in unicode builds expands to L X, creating a wide character literal. This is designed so that your code can compile in a unciode or multi-byte configuration. To achieve the same flexibility when declaring a variable, you use the TCHAR type instead of char or wchar_t. So your second snippet should look like
TCHAR* pathOfFileType;
hFind = FindFirstFile(pathOfFileType, &FindFileData);
Note no _T macro, that is only applied to string literals, not identifiers.
"since this a MFC application I have to convert the char array to UNICODE"
Not so. If you wish, you can use change to use the Multi-Byte Character Set.
In project properties, general change character set to 'Use Multi-Byte Character Set'
Now this will work
char* pathOfFileType;
hFind = FindFirstFile(pathOfFileType, &FindFileData);
Supposing you want to use UNICODE ( visual studio's name for the 2 byte encoding of UNICODE characters native to Windows ) then you have to explicitly call the MBCS version of the API
char* pathOfFileType;
hFind = FindFirstFileA(pathOfFileType, &FindFileData);

Resources