Unfathomable problem with unicode and frameworks

Unfathomable problem with unicode and frameworks - visual-c++

I'm experiencing a very strange problem... The following trivial test code works as it should if it is injected in a single Cocoa application, but when I use it in one of my frameworks, I get absolutely unexpected results...
wchar_t Buf[2048];
wcscpy(Buf, L"/zbxbxklbvasyfiogkhgfdbxbx/bxkfiorjhsdfohdf/xbxasdoipppwejngfd/gjfdhjgfdfdjkg.sdfsdsrtlrt.ljlg/fghlfg");
int len1 = wcslen(L"/zbxbxklbvasyfiogkhgfdbxbx/bxkfiorjhsdfohdf/xbxasdoipppwejngfd/gjfdhjgfdfdjkg.sdfsdsrtlrt.ljlg/fghlfg");
int len2 = wcslen(Buf);
char Buf2[2048];
Buf2[0]=0;
wcstombs(Buf2, Buf, 2048);
// ??? Buf2 == ""
// ??? len1 == len2 == 57, but should be 101
How can this be, have I gone mad? Even if there was a memory corruption, it couldn't possibly corrupt all these values allocated on stack... Why won't even the wcslen(L"MyWideString") work? Changing test string changes its length, but it is always wrong, wcstombs returns -1...
setlocale() is not used anywhere, test string contains only ASCII characters, in order to ease porting I use the -fshort-wchar compiler option, but it works fine in case of a test Cocoa application...
Please help!

I've just tested this again with GCC 4.6. In the standard settings, this works as expected, giving 101 for all the lengths. However, with your option -fshort-wchar I also get unexpected results (51 in my case, and 251 for the final conversion after using setlocale()).
So I looked up the man entry for the option:
Warning: the -fshort-wchar switch causes GCC to generate code that is not binary compatible with code generated without that switch. Use it to conform to a non-default application binary interface.
I think that explains it: When you're linking to the standard library, you are expected to use the correct ABI and type conventions, which you are overriding with that option.

Wide char implementation in C/C++ can be anything, including 1 byte, 2 bytes or 4 bytes. This depends on the compiler and the platform you are compiling to.
Probably wikipedia is not the best place to quote from but in this case:
http://en.wikipedia.org/wiki/Wide_character states that
... width of wchar_t is compiler-specific and can be as small as 8 bits.
and
... wide characters should be 16-bit values under C90 due to historical compatibility reasons. C and C++ compilers that comply with the 10646-1:2000 Unicode standard generally assume 32-bit values....
So, do not assume and use the sizeof(wchar_t).

-fshort-wchar change the compiler's ABI, so you need to recompile glibc, libgcc and all library using wchar_t. Otherwise, wcslen and other functions in glibc are still assume wchar_t is 4 bytes.
see: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42092

Related

"getenv... function ... may be unsafe" - really?

I'm using MSVC to compile some C code which uses standard-library functions, such as getenv(), sprintf and others, with /W3 set for warnings. I'm told by MSVC that:
'getenv': This function or variable may be unsafe. Consider using _dupenv_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS
Questions:
Why would this be unsafe, theoretically - as opposed to its use on other platforms?
Is it unsafe on Windows in practice?
Assuming I'm not writing security-oriented code - should I disable this warning or actually start aliasing a bunch of standard library functions?

getenv() is potentially unsafe in that subsequent calls to that same function may invalidate earlier returned pointers. As a result, usage such as
char *a = getenv("A");
char *b = getenv("B");
/* do stuff with both a and b */
may break, because there's no guarantee a is still usable at that point.
getenv_s() - available in the C standard library since C11 - avoids this by immediately copying the value into a caller-supplied buffer, where the caller has full control over the buffer's lifetime. dupenv_s() avoids this by making the caller responsible for managing the lifetime of the allocated buffer.
However, the signature for getenv_s is somewhat controvertial, and the function may even be removed from the C standard at some point... see this report.

getenv suffers like much of the classic C Standard Library by not bounding the string buffer length. This is where security bugs like buffer overrun often originate from.
If you look at getenv_s you'll see it provides an explicit bound on the length of the returned string. It's recommended for all coding by the Security Development Lifecycle best practice, which is why Visual C++ emits deprecation warnings for the less secure versions.
See MSDN and this blog post
There was an effort by Microsoft to get the C/C++ ISO Standard Library to include the Secure CRT here, some of which was approved for C11 Annex K as noted here. That also means that getenv_s should be part of the C++17 Standard Library by reference. That said, Annex K is officially considered optional for conformance. The _s bounds-checking versions of these functions are also still a subject of some debate in the C/C++ community.

Discover if structure initialization doesn't modify all members

Consider the following code:
typedef struct _sMYSTRUCT_BASE
{
int b_a;
int b_b;
int b_c;
} sMYSTRUCT_BASE;
typedef struct _sMYSTRUCT
{
sMYSTRUCT_BASE base;
int a;
int b;
} sMYSTRUCT;
Private const sMYSTRUCT mystruct_init =
{
0,
1,
3,
4
};
I am looking for a way to generate an error (compile-, or runtime) to indicated that the structure initialization hasn't explicitly 'touched' all structure members.
There are 5 integers in the structure, but 'mystruct_init' only have 4 values.
I know that last member (mystruct_init.b) will be zero, but I need some kind of warning/error to inform the programmer about the mistake.
This has to work on a very old compiler (maybe not even ansi-c compliant).

Modern compilers are capable of producing such a warning...in gcc, it's turned on with -Wmissing-field-initializers (which warns about initializers that exist but do not initialize all members, but not about structs with no initializer expression; these can at least sometimes be caught by turning on -Wuninitialized, which will warn you if it sees you reading a potentially uninitialized value, at least if you read it in the same function the variable was declared in).
If your very old compiler happens to supply such a warning, you could of course just turn it on, but that seems unlikely from your description.
Your best option, I think, if you want to do an exhaustive search for them, would be to see whether you can get the code to compile with some version of gcc -- it wouldn't have to compile well enough to actually run on your target platform in order to get the warnings. I can't guarantee that it will be able to compile your pre-ANSI C code, particularly if it widely uses compiler-specific extensions, but I can at least say that support for the legacy K&R syntax is still present in the modern C standard, so I wouldn't be surprised if your code compiles better than you might think.
If that works, then to consistently produce the warnings in your IDE, you could modify the build script so that it both compiles and links the code with the real compiler you're targeting, and also compiles it (but not necessarily links it) with gcc, just to generate additional warnings that can be picked up and displayed by the IDE.
The other option would be to see if you can find a compatible static analyzer that can perform such a check; I work on a tool called EnSoft Atlas that builds a data-flow graph which, together with a simple script, could be used to enforce initialization more thoroughly than the gcc warnings allow, by checking whether flow of uninitialized values to fields of structs occurs.
However, our support for C is still in beta. Atlas requires that Eclipse CDT (or JDT for Java) be able to parse your code, and the current C beta only fully supports modern strongly-typed struct initializers (i.e. struct foo f = (struct foo) {...} has fully connected data-flow, but support for the older initializer list syntax struct foo f = {...} was not implemented in our first pass), so I'm not sure it would be able to meet your needs at this time.

Is there a workaround for _Complex syntax in VC++?

I've got a library compiled with MinGW which supports the C99 Keywords, _Complex. I'd like to use this library with MSVC++ 2010 compiler. I've tried to temporarily switch off all the _Complex syntax code, so that it compiles. I found most of the other functions worked fine in MSVC++. Now I want to enable the parts with _Complex definition, but really don't know how to.
Obviously I can't recompile it in MSVC++ as the library asks for C99 features, etc. However, I feel like it is such a waste to give it up, and look for substutions, because it works perfect with most other functions.
I think I can write wrappers of the APIs that require _Complex syntax and compile it with MinGW GCC then it will be able to import into my MSVC project. But I still want to know if there is any better workaround of this problem, like what is the "standard" way people dealing with problem when compile C99 complex number syntax in VC++?
Xing.

From the C Standard (C11 §6.2.5 ¶13; C99 has approximately the same language):
Each complex type has the same representation and alignment requirements as an array
type containing exactly two elements of the corresponding real type; the first element is
equal to the real part, and the second element to the imaginary part, of the complex
number.
I don’t have the C++ Standard in front of me, but the complex type templates defined in <complex> have the same requirement; this is intended for compatibility.
You can therefore re-write C functions taking & returning values of type double _Complex as C++ functions taking & returning values of type std::complex<double>; so long as name-mangling on the C++ side has been turned off (via extern "C") both sides will be compatible.
Something like this might help:
#ifdef __cplusplus
#include <complex>
#define std_complex(T) std::complex<T>
#else
#define std_complex(T) T _Complex
#endif

Why is /Wp64 deprecated?

Why is the /Wp64 flag in Visual C++ deprecated?
cl : Command line warning D9035 :
option 'Wp64' has been deprecated and will be removed in a future release

I think that/Wp64 is deprecated mainly because compiling for a 64-bit target will catch the kinds of errors it was designed to catch (/Wp64 is only valid in 32-bit compiles). The option was added back when 64-bit targets were emerging to help people migrate their programs to 64-bit and help detect code that wasn't '64-bit clean'.
Here's an example of the kinds of problems with /Wp64 that Microsoft just isn't interested in fixing - probably rightly so (from http://connect.microsoft.com/VisualStudio/feedback/details/502281/std-vector-incompatible-with-wp64-compiler-option):
Actually, the STL isn't intentionally incompatible with /Wp64, nor is
it completely and unconditionally incompatible with /Wp64. The
underlying problem is that /Wp64 interacts extremely badly with
templates, because __w64 isn't fully integrated into the type system.
Therefore, if vector<unsigned int> is instantiated before vector<__w64 unsigned int>, then both of them will behave like vector<unsigned int>, and vice versa. On x86, SOCKET is a typedef for __w64 unsigned int. It's not obvious, but vector<unsigned int> is being instantiated
before your vector<SOCKET>, since vector<bool> is backed (in our
implementation) by vector<unsigned int>.
Previously (in VC9 and earlier), this bad interaction between /Wp64
and templates caused spurious warnings. In VC10, however, changes to
the STL have made this worse. Now, when vector::push_back() is given
an element of the vector itself, it figures out the element's index
before doing other work. That index is obtained by subtracting the
element's address from the beginning of the vector. In your repro,
this involves subtracting const SOCKET * - unsigned int *. (The latter
is unsigned int * and not SOCKET * due to the previously described
bug.) This /should/ trigger a spurious warning, saying "I'm
subtracting pointers that point to the same type on x86, but to
different types on x64". However, there is a SECOND bug here, where
/Wp64 gets really confused and thinks this is a hard error (while
adding constness to the unsigned int *).
We agree that this bogus error message is confusing. However, since
it's preceded by an un-silenceable command line deprecation warning
D9035, we believe that that should be sufficient. D9035 already says
that /Wp64 shouldn't be used (although it doesn't go on to say "this
option is super duper buggy, and completely unnecessary now").
In the STL, we could #error when /Wp64 is used. However, that would
break customers who are still compiling with /Wp64 (despite the
deprecation warning) and aren't triggering this bogus error. The STL
could also emit a warning, but the compiler is already emitting D9035.

/Wp64 on 32-bit builds is a waste of time. It is deprecated, and this deprecation makes sense. The way /Wp64 worked on 32-bit builds is it would look for a _w64 annotation on a type. This _w64 annotation would tell the compiler that, even though this type is 32-bits in 32-bit mode, it is 64-bits in 64-bit mode. This turned out to be really flakey, especially where templates are involved.
/Wp64 on 64-bit builds is extremely useful. The documentation (http://msdn.microsoft.com/en-us/library/vstudio/yt4xw8fh.aspx) claims that it is on by default in 64-bit builds, but this is not true. Compiler warnings C4311 and C4312 are only emitted if /Wp64 is explicitly set. Those two warnings indicate when a 32-bit value is put into a pointer, or vice-versa. These are very important for code correctness, and claim to be at warning level 1. I have found bugs in very widespread code that would have been stopped if the developers had turned on /Wp64 for 64-bit builds. Unfortunately, you also get the command line warning that you have observed. I know of no way to squelch this warning, and I have learned to live with it. On the bright side, if you build with as warnings as errors, this command line warning doesn't turn into an error.

Because when using the 64 Bit compiler from VS2010 the compiler does the detection of 64 bit problems automatically... this switch is from back in the day when you could try to detect 64 Bit problem running the 32 Bit compiler...
See http://msdn.microsoft.com/en-us/library/yt4xw8fh%28v=VS.100%29.aspx

You could link to the deprecation warning, but couldn't go to the /Wp64 documentation?
By default, the /Wp64 compiler option is off in the Visual C++ 32-bit compiler and on in the Visual C++ 64-bit compiler.
If you regularly compile your application by using a 64-bit compiler, you can just disable /Wp64 in your 32-bit compilations because the 64-bit compiler will detect all issues.
Emphasis added

Why are there so many string types in MFC?

LPTSTR* arrpsz = new LPTSTR[ m_iNumColumns ];
arrpsz[ 0 ] = new TCHAR[ lstrlen( pszText ) + 1 ];
(void)lstrcpy( arrpsz[ 0 ], pszText );
This is a code snippet about String in MFC and there are also _T("HELLO"). Why are there so many String types in MFC? What are they used for?

Strictly speaking, what you're showing here are windows specific strings, not MFC String types (but your point is even better taken if you add in CString and std::string). It's more complex than it needs to be -- largely for historical reasons.
tchar.h is definitely worth looking at -- also search for TCHAR on MSDN.
There's an old joke about string processing in C that you may find amusing: string handling in C is so efficient because there's no string type.

Historical reasons.
The original windows APIs were in C (unless the real originals were in Pascal and have been lost in the mists). Microsoft created its own datatypes to represent C datatypes, likely because C datatypes are not standard in their size. (For C integral types, char is at least 8 bits, short is at least 16 bits and at least as big as a char, int is at least 16 bits and at least as big as a short, and long is at least 32 bits and at least as big as an int.) Since Windows ran first on essentially 16-bit systems and later 32-bit, C compilers didn't necessarily agree on sizes. Microsoft further designated more complex types, so (if I've got this right) a C char * would be referred to as a LPCSTR.
Thing is, an 8-bit character is not suitable for Unicode, as UTF-8 is not easy to retrofit into C or C++. Therefore, they needed a wide character type, which in C would be referred to as wchar_t, but which got a set of Microsoft datatypes corresponding to the earlier ones. Furthermore, since people might want to compile sometimes in Unicode and sometimes in ASCII, they made the TCHAR character type, and corresponding string types, which would be based on either char (for ASCII compilation) or wchar_t (for Unicode).
Then came MFC and C++ (sigh of relief) and Microsoft wanted a string type. Since this was before the standardization of C++, there was no std::string, so they invented CString. (They also had container classes that weren't compatible with what came to be the STL and then the containers part of the library.)
Like any mature and heavily used application or API, there's a whole lot in it that would be done completely differently if it were possible to do it over from scratch.

See Generic-Text Mappings in TCHAR.H and the description of LPTSTR in Windows Data Types.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string