Is there a difference between datatypes on different bit-size OSes? - 64-bit

I have a C program that I know works on 32-bit systems. On 64-Bit systems (at least mine) it works to a point and then stops. Reading some forums the program may not be 64-bit safe? I assume it has to do with differences of data types between 32-bit and 64-bit systems.
Is a char the same on both? what about int or long or their unsigned variants? Is there any other way a 32-bit program wouldn't be 64-bit safe? If I wanted to verify the application is 64-bit safe, what steps should I take?

Regular data types in C has minimum ranges of values rather than specific bit widths. For example, a short has to be able to represent, at a minimum, -32767 thru 32767 inclusive.
So,yes, if your code depends on values wrapping around at 32768, it's unlikely to behave well if the short is some big honking 128-bit behemoth.
If you want specific-width data types, look into stdint.h for things like int64_t and so on. There are a wide variety to choose from, specific widths, "at-least" widths, and so on. They also mandate two's complement for these, unlike the "regular" integral types:
integer types having certain exact widths;
integer types having at least certain specified widths;
fastest integer types having at least certain specified widths;
integer types wide enough to hold pointers to objects;
integer types having greatest width.
For example, from C11 7.20.1.1 Exact-width integer types:
The typedef name intN_t designates a signed integer type with width N, no padding
bits, and a two’s complement representation. Thus, int8_t denotes such a signed
integer type with a width of exactly 8 bits.
Provided you have followed the rules (things like not casting pointers to integers), your code should compile and run on any implementation, and any architecture.
If it doesn't, you'll just have to start debugging, then post the detailed information and code that seems to be causing problem on a forum site dedicated to such things. Now where have I seen one of those recently? :-)

Related

What does Int use three bits for? [duplicate]

Why is GHC's Int type not guaranteed to use exactly 32 bits of precision? This document claim it has at least 30-bit signed precision. Is it somehow related to fitting Maybe Int or similar into 32-bits?
It is to allow implementations of Haskell that use tagging. When using tagging you need a few bits as tags (at least one, two is better). I'm not sure there currently are any such implementations, but I seem to remember Yale Haskell used it.
Tagging can somewhat avoid the disadvantages of boxing, since you no longer have to box everything; instead the tag bit will tell you if it's evaluated etc.
The Haskell language definition states that the type Int covers at least the range [−229, 229−1].
There are other compilers/interpreters that use this property to boost the execution time of the resulting program.
All internal references to (aligned) Haskell data point to memory addresses that are multiple of 4(8) on 32-bit(64-bit) systems. So, references need only 30bits(61bits) and therefore allow 2(3) bits for "pointer tagging".
In case of data, the GHC uses those tags to store information about that referenced data, i.e. whether that value is already evaluated and if so which constructor it has.
In case of 30-bit Ints (so, not GHC), you could use one bit to decide if it is either a pointer to an unevaluated Int or that Int itself.
Pointer tagging could be used for one-bit reference counting, which can speed up the garbage collection process. That can be useful in cases where a direct one-to-one producer-consumer relationship was created at runtime: It would result directly in memory reuse instead of a garbage collector feeding.
So, using 2 bits for pointer tagging, there could be some wild combination of intense optimisation...
In case of Ints I could imagine these 4 tags:
a singular reference to an unevaluated Int
one of many references to the same possibly still unevaluated Int
30 bits of that Int itself
a reference (of possibly many references) to an evaluated 32-bit Int.
I think this is because of early ways to implement GC and all that stuff. If you have 32 bits available and you only need 30, you could use those two spare bits to implement interesting things, for instance using a zero in the least significant bit to denote a value and a one for a pointer.
Today the implementations don't use those bits so an Int has at least 32 bits on GHC. (That's not entirely true. IIRC one can set some flags to have 30 or 31 bit Ints)

XSD: What is the difference between xs:integer and xs:int?

I have started to create XSD and found in couple of examples for xs:integer and xs:int.
What is the difference between xs:integer and xs:int?
When I should use xs:integer?
When I should use xs:int?
The difference is the following:
xs:int is a signed 32-bit integer.
xs:integer is an integer unbounded value.
See for details https://web.archive.org/web/20151117073716/http://www.w3schools.com/schema/schema_dtypes_numeric.asp
For example, XJC (Java) generates Integer for xs:int and BigInteger for xs:integer.
The bottom line: use xs:int if you want to work cross platforms and be sure that your numbers will pass without a problem.
If you want bigger numbers – use xs:long instead of xs:integer (it will be generated to Long).
The xs:integer type is a restriction of xs:decimal, with the fractionDigits facet set to zero and with a lexical space which forbids the decimal point and trailing zeroes which would otherwise be legal. It has no minimum or maximum value, though implementations running in machines of finite size are not required to be able to accept arbitrarily large or small values. (They are required to support values with 16 decimal digits.)
The xs:int type is a restriction of xs:long, with the maxInclusive facet set to 2147483647 and the minInclusive facet to -2147483648. (As you can see, it will fit conveniently into a two-complement 32-bit signed-integer field; xs:long fits in a 64-bit signed-integer field.)
The usual rule is: use the one that matches what you want to say. If the constraint on an element or attribute is that its value must be an integer, xs:integer says that concisely. If the constraint is that the value must be an integer that can be expressed with at most 32 bits in twos-complement representation, use xs:int. (A secondary but sometimes important concern is whether your tool chain works better with one than with the other. For data that will live longer than your tool chain, it's wise to listen to the data first; for data that exists solely to feed the tool chain, and which will be of no interest if you change your tool chain, there's no reason not to listen to the tool chain.)
I would just add a note of pedantry that may be important to some people: it's not correct to say that xs:int "is" a signed 32-bit integer. That form of words implies an implementation in memory (or registers, etc) within a binary digital computer. XML is character-based and would implement the maximum 32-bit signed value as "2147483647" (my quotes, of course), which is a lot more than 32 bits! What IS true is that xs:int is (indirectly) a restriction of xs:integer which sets the maximum and minimum allowed values to be the same as the corresponding implementation-imposed limits of a 32-bit integer with a sign bit.

`Integer` vs `Int64` vs `Word64`

I have some data which can be represented by an unsigned Integral type and its biggest value requires 52 bits. AFAIK only Integer, Int64 and Word64 satisfy these requirements.
All the information I could find out about those types was that Integer is signed and has a floating unlimited bit-size, Int64 and Word64 are fixed and signed and unsigned respectively. What I coudn't find out was the information on the actual implementation of those types:
How many bits will a 52-bit value actually occupy if stored as an Integer?
Am I correct that Int64 and Word64 allow you to store a 64-bit data and weigh exactly 64 bits for any value?
Are any of those types more performant or preferrable for any other reasons than size, e.g. native code implementations or direct processor instructions-related optimizations?
And just in case: which one would you recommend for storing a 52-bit value in an application extremely sensitive in terms of performance?
How many bits will a 52-bit value actually occupy if stored as an Integer?
This is implementation-dependent. With GHC, values that fit inside a machine word are stored directly in a constructor of Integer, so if you're on a 64-bit machine, it should take the same amount of space as an Int. This corresponds to the S# constructor of Integer:
data Integer = S# Int#
| J# Int# ByteArray#
Larger values (i.e. those represented with J#) are stored with GMP.
Am I correct that Int64 and Word64 allow you to store a 64-bit data and weigh exactly 64 bits for any value?
Not quite — they're boxed. An Int64 is actually a pointer to either an unevaluated thunk or a one-word pointer to an info table plus a 64-bit integer value. (See the GHC commentary for more information.)
If you really want something that's guaranteed to be 64 bits, no exceptions, then you can use an unboxed type like Int64#, but I would strongly recommend profiling first; unboxed values are quite painful to use. For instance, you can't use unboxed types as arguments to type constructors, so you can't have a list of Int64#s. You also have to use operations specific to unboxed integers. And, of course, all of this is extremely GHC-specific.
If you're looking to store a lot of 52-bit integers, you might want to use vector or repa (built on vector, with fancy things like automatic parallelism); they store the values unboxed under the hood, but let you work with them in boxed form. (Of course, each individual value you take out will be boxed.)
Are any of those types more performant or preferrable for any other reasons than size, e.g. native code implementations or direct processor instructions-related optimizations?
Yes; using Integer incurs a branch for every operation, since it has to distinguish the machine-word and bignum cases; and, of course, it has to handle overflow. Fixed-size integral types avoid this overhead.
And just in case: which one would you recommend for storing a 52-bit value in an application extremely sensitive in terms of performance?
If you're using a 64-bit machine: Int64 or, if you must, Int64#.
If you're using a 32-bit machine: Probably Integer, since on 32-bit Int64 is emulated with FFI calls to GHC functions that are probably not very highly optimised, but I'd try both and benchmark it. With Integer, you'll get the best performance on small integers, and GMP is heavily-optimised, so it'll probably do better on the larger ones than you might think.
You could select between Int64 and Integer at compile-time using the C preprocessor (enabled with {-# LANGUAGE CPP #-}); I think it would be easy to get Cabal to control a #define based on the word width of the target architecture. Beware, of course, that they are not the same; you will have to be careful to avoid "overflows" in the Integer code, and e.g. Int64 is an instance of Bounded but Integer is not. It might be simplest to just target a single word width (and thus type) for performance and live with the slower performance on the other.
I would suggest creating your own Int52 type as a newtype wrapper over Int64, or a Word52 wrapper over Word64 — just pick whichever matches your data better, there should be no performance impact; if it's just arbitrary bits I'd go with Int64, just because Int is more common than Word.
You can define all the instances to handle wrapping automatically (try :info Int64 in GHCi to find out which instances you'll want to define), and provide "unsafe" operations that just apply directly under the newtype for performance-critical situations where you know there won't be any overflow.
Then, if you don't export the newtype constructor, you can always swap the implementation of Int52 later, without changing any of the rest of your code. Don't worry about the overhead of a separate type — the runtime representation of a newtype is completely identical to the underlying type; they only exist at compile-time.

Bit Size of GHC's Int Type

Why is GHC's Int type not guaranteed to use exactly 32 bits of precision? This document claim it has at least 30-bit signed precision. Is it somehow related to fitting Maybe Int or similar into 32-bits?
It is to allow implementations of Haskell that use tagging. When using tagging you need a few bits as tags (at least one, two is better). I'm not sure there currently are any such implementations, but I seem to remember Yale Haskell used it.
Tagging can somewhat avoid the disadvantages of boxing, since you no longer have to box everything; instead the tag bit will tell you if it's evaluated etc.
The Haskell language definition states that the type Int covers at least the range [−229, 229−1].
There are other compilers/interpreters that use this property to boost the execution time of the resulting program.
All internal references to (aligned) Haskell data point to memory addresses that are multiple of 4(8) on 32-bit(64-bit) systems. So, references need only 30bits(61bits) and therefore allow 2(3) bits for "pointer tagging".
In case of data, the GHC uses those tags to store information about that referenced data, i.e. whether that value is already evaluated and if so which constructor it has.
In case of 30-bit Ints (so, not GHC), you could use one bit to decide if it is either a pointer to an unevaluated Int or that Int itself.
Pointer tagging could be used for one-bit reference counting, which can speed up the garbage collection process. That can be useful in cases where a direct one-to-one producer-consumer relationship was created at runtime: It would result directly in memory reuse instead of a garbage collector feeding.
So, using 2 bits for pointer tagging, there could be some wild combination of intense optimisation...
In case of Ints I could imagine these 4 tags:
a singular reference to an unevaluated Int
one of many references to the same possibly still unevaluated Int
30 bits of that Int itself
a reference (of possibly many references) to an evaluated 32-bit Int.
I think this is because of early ways to implement GC and all that stuff. If you have 32 bits available and you only need 30, you could use those two spare bits to implement interesting things, for instance using a zero in the least significant bit to denote a value and a one for a pointer.
Today the implementations don't use those bits so an Int has at least 32 bits on GHC. (That's not entirely true. IIRC one can set some flags to have 30 or 31 bit Ints)

Why are there so many string types in MFC?

LPTSTR* arrpsz = new LPTSTR[ m_iNumColumns ];
arrpsz[ 0 ] = new TCHAR[ lstrlen( pszText ) + 1 ];
(void)lstrcpy( arrpsz[ 0 ], pszText );
This is a code snippet about String in MFC and there are also _T("HELLO"). Why are there so many String types in MFC? What are they used for?
Strictly speaking, what you're showing here are windows specific strings, not MFC String types (but your point is even better taken if you add in CString and std::string). It's more complex than it needs to be -- largely for historical reasons.
tchar.h is definitely worth looking at -- also search for TCHAR on MSDN.
There's an old joke about string processing in C that you may find amusing: string handling in C is so efficient because there's no string type.
Historical reasons.
The original windows APIs were in C (unless the real originals were in Pascal and have been lost in the mists). Microsoft created its own datatypes to represent C datatypes, likely because C datatypes are not standard in their size. (For C integral types, char is at least 8 bits, short is at least 16 bits and at least as big as a char, int is at least 16 bits and at least as big as a short, and long is at least 32 bits and at least as big as an int.) Since Windows ran first on essentially 16-bit systems and later 32-bit, C compilers didn't necessarily agree on sizes. Microsoft further designated more complex types, so (if I've got this right) a C char * would be referred to as a LPCSTR.
Thing is, an 8-bit character is not suitable for Unicode, as UTF-8 is not easy to retrofit into C or C++. Therefore, they needed a wide character type, which in C would be referred to as wchar_t, but which got a set of Microsoft datatypes corresponding to the earlier ones. Furthermore, since people might want to compile sometimes in Unicode and sometimes in ASCII, they made the TCHAR character type, and corresponding string types, which would be based on either char (for ASCII compilation) or wchar_t (for Unicode).
Then came MFC and C++ (sigh of relief) and Microsoft wanted a string type. Since this was before the standardization of C++, there was no std::string, so they invented CString. (They also had container classes that weren't compatible with what came to be the STL and then the containers part of the library.)
Like any mature and heavily used application or API, there's a whole lot in it that would be done completely differently if it were possible to do it over from scratch.
See Generic-Text Mappings in TCHAR.H and the description of LPTSTR in Windows Data Types.

Resources