Unit conversion errors for value objects in ddd - domain-driven-design

As you might know DDD literature suggests that we should treat " numeric quantativies with some unit " as value objects, not as primitive types ( ints, bigdecimal ). Some examples of such value objects are money, distance or file size. I agree with the big picture
However there is something I cannot understand. Namely conversion errors when representating something in one unit, converting it to other unit and back. This process might lose some information. Take for example file size. Lets say I have file whose size is 3.67 MB and I convert that to other instance of FileSize whose unit would be GB by dividing 3.67 with 1024. Now I have FileSize of ( approximately ) 0.00358398437 GB. If I now try to convert it back to MB the result is not 3.67 MB. If however I dont use value object but instead only use primitive information " sizeInBytes " ( long ) I cannot lose information on conversion errors.
I must have missed something. Is my example just plain stupid? Or is it acceptable to lose some info when converting from one unit to another? Or should FileSize always carry also excat file size in bytes ( with approx.size in given unit )?
Thanks in advance!

What you are describing is more an implementation problem of your concrete example than a problem with the approach. The idea of using value objects to represent amounts with a unit is to avoid mistakes like adding Liters to Kilometers or doing 10cm + 10Km = 20cm. Value objects, when developed correctly, will enforce that the operations are done correctly between different units.
Now, how you implement these value objects with your programming language, is a different problem. But for your concrete example, I would say that the value object will internally have a long field with the size in Bytes, no matter what unit you use to initialize the object. In this case, the unit will be used to convert the initialization value to the right amount of bytes and also for display purposes, but when you have to add 2 FileSizes, you can add the internal amounts in bytes.

we should treat " numeric quantativies with some unit " as value objects, not as primitive types ( ints, bigdecimal ).
Yes, that's right. More generally, we're encouraged to encapsulate data structures (an integer alone is a trivially simple data structure) behind domain specific abstractions. This is one good way to leverage type checking - it gives the checker the hints that it needs to detect a category of dumb mistakes.
Namely conversion errors when representating something in one unit, converting it to other unit and back. This process might lose some information.
That's right. More generally: rounding loses information.
I dont use value object but instead only use primitive information " sizeInBytes " ( long ) I cannot lose information on conversion errors.
So look carefully at that: if you perform the same sequence of conversions you described using primitive data structures, you would end up with the same rounding error (after all, that's where the rounding error came from: the abstraction of the measurement defers the calculation to its internal general purpose representation).
The thing that saves you from the error is not discarding the original exact answer.
What domain modeling is telling you to do is make explicit which values are "exact" and which have "rounding errors".
(Note that in some domains, they aren't even "errors"; many domains have explicit rules about how rounding is supposed to happen. Sadly, they are rarely the rounding rules defined by IEEE-754, so you can't just lean on the general purpose floating point type.)
DDD will also encourage you to track precisely which values are for display/reporting, and which are to be used in later calculations.

Reading this, I think you're misunderstanding what DDD is. The first D is DDD, stands for Domain - aka Domain is a sphere of knowledge. The way you represent a sphere of knowledge aka a Domain - is entirely based on the business domain you're attempting to represent, and will be different based on the business domain.
So...
Domain A: Business User that has X amount of storage space
I upload X file
file X uses 3.67 MB
You have used 1% of your allocated space.
You have 97 MB space remaining
Domain B: Sys Admin - total space is Y amount of storage space
Users have uploaded 3.67 MB
That user has used 1% of their space
That user has 97 MB space remaining
There is 1000 GB total space remaining to allocate to all users / total space remaining.
aka. Sys Admin has one domain - total disk; User has allocated space (sub-set) - they have different domains of knowledge - space.
Also note... DDD is really about sectioning of a domain or sphere of knowledge to the specific users of sub-sections of a system - and not the facts of a system. aka Facts are different from knowledge.
I hope this makes some sense!

Related

How do you approach creating a complete new datatype on the "bit-level"?

I would like to create a new data type in Rust on the "bit-level".
For example, a quadruple-precision float. I could create a structure that has two double-precision floats and arbitrarily increase the precision by splitting the quad into two doubles, but I don't want to do that (that's what I mean by on the "bit-level").
I thought about using a u8-array or a bool-array but in both cases, I waste 7 bits of memory (because also bool is a byte large). I know there are several crates that implement something like bit-arrays or bit-vectors, but looking through their source code didn't help me to understand their implementation.
How would I create such a bit-array without wasting memory, and is this the way I would want to choose when implementing something like a quad-precision type?
I don't know how to implement new data types that don't use the basic types or are structures that combine the basic types, and I haven't been able to find a solution on the internet yet; maybe I'm not searching with the right keywords.
The question you are asking has no direct answer: Just like any other programming language, Rust has a basic set of rules for type layouts. This is due to the fact that (most) real-world CPUs can't address individual bits, need certain alignments when referencing memory, have rules regarding how pointer arithmetic works etc. etc.
For instance, if you create a type of just two bits, you'll still need an 8-bit byte to represent that type, because there is simply no way to address two individual bits on most CPU's opcodes; there is also no way to take the address of such a type because addressing works at least on the byte-level. More useful information regarding this can be found here, section 2, The Anatomy of a Type. Be aware that the non-wasting bit-level type you are thinking about needs to fulfill all the rules mentioned there.
It's a perfectly reasonable approach to represent what you want to do e.g. either as a single, wrapped u128 and implement all arithmetic on top of that type. Another, more generic, approach would be to use a Vec<u8>. You'll always do a relatively large amount of bit-masking, indirecting and such.
Having a look at rust_decimal or similar crates might also be a good idea.

How to use ApplicationDataTypes in C code

For my understanding, the ApplicationDataType was introduced to AUTOSAR Version 4 to design Software-Components that are independent of the underlying platform and are therefore re-usable in different projects and applications.
But how about the implementation behind such a SW-C to be platform independent?
Use-case example: You want to design and implement a SW-C that works as a FiFo. You have one Port for Input-Data, an internal buffer and one Port for Output-Data. You could implement this without knowing about the data type of the data by using the “abstract” ApplicationDataType.
By using an ApplicationDataType for a variable as part of a PortInterface sooner or later you have to map this ApplicationDataType to an ImplementationDataType for the RTE-Generator.
Finally, the code created by the RTE-Generator only uses the ImplementationDataType. The ApplicationDataType is nowhere to be found in the generated code.
Is this intended behavior or a bug of the RTE-Generator?
(Or maybe I'm missing something?)
It is intended that ApplicationDataTypes do not directly appear in code, they are represented by their ImplementationDataType counterparts.
The motivation for the definition of data types on different levels of abstraction is explained in the AUTOSAR specifications, namely the TPS Software Component Template.
You will never find an ApplicationDataType in the C code, because it's defined on a physical level with a physical unit and might have a (completly) different representation on the implementation level in C.
Imagine a battery control sensor that measures the voltage. The value can be in range 0.0V and 14.0V with one digit after the decimal point (physical). You could map it to a float in C but floating point operations are expensive. Instead, you use a fixed point arithmetic where you map the phyiscal value 0.0 to 0, 0.1 to 1, 0.2 to 2 and so on. This mapping is described by a so called compuMethod.
The software component will always use the internal representation. So, why do you need the ApplicationDataType then? There are many reasons to use them, some of them are:
Methodology: The software component designer doesn't need to worry about the implementation in C. Somebody else can define that in a later stage.
Measurement If you measure the value, you have a well defined compuMethod and know the physical interpretation of the value in C.
Data conversion: If you connect software component with different units e.g. km/h vs mph, the Rte could automatically convert the internal representation between them.
Constant conversion: You can specify an initial value on the physical value (e.g. 10.6V) and the Rte will convert it to the internal representation.
Variable Size Arrays: Without dynamic memory allocation, you cannot have a variable size array in C. But you could reserve some (max) memory in an array and store the actual length in a seperate field. On the implementation level you have then a struct with two members (value, length). But on the application level you just have an array.
from AUTOSAR_TPS_SoftwareComponentTemplate.pdf
ApplicationDataType defines a data type from the application point of
view. Especially it should be used whenever something "physical" is at
stake.
An ApplicationDataType represents a set of values as seen in the
application model, such as measurement units. It does not consider
implementation details such as bit-size, endianess, etc.
It should be possible to model the application level aspects of a VFB
system by using ApplicationDataTypes only.

Small subset of huge matrix-like structure from disk transparently

A simplified version of the question
I have a huge matrix-like dataset, that we for now can pretend is actually an n-by-n matrix stored on-disk as n^2 IEEE-754 doubles (see details below the line on how this is a simplification - it probably matters). The file is on the order of a gigabyte, but in a certain (pure) function I will only need on the order of n of the elements contained in it. Exactly which elements will be needed is complicated, and not something like a simple slice.
What are my options for decoupling reading the file from disk and the computation? Most of all, I'd like to treat the on-disk data as if it were in memory (I am of course ready to swear to all the gods of referential transparency that the data on disk will not change). I've looked at mmap and friends, but some cursory testing shows that these seem not to aggressively enough free memory.
Do I have to go couple my computations to IO if I need such fine-grained control of how much of the file is kept in memory?
A more honest description of the on-disk data
The data on disk isn't actually as simple as described. Something closer to the truth would be the following: A file begins with a 32 bit integer n. The following then occurs precisely n times: A 32 bit integer m_i > 0 (1 ≤ i ≤ n), followed by exactly m_i IEEE-754 doubles x_(i,1),…,x_(i, m_i). (So, this is a jagged two-dimensional array).
In practice, determining i and j for which x_(i, j) is needed depends highly on the m_i's. When approaching the problem with mmap, the need to read so many of these m_is seems to essentially load the entire file into memory. The problem is that it all seems to stay there, and I worry that I will have to pull my computation into IO to have more fine-grained control over the releasing of this memory.
Moreover, "the data structure" actually consists of a large number of these files parameterized by their file names. Together they amount to about a gigabyte.
An attempt at a more handwaving, but possibly easier to understand version of the question
Say I have some data on disk consisting of n^2 elements. A pure Haskell function needs on the order of n of the elements, but which of them depends in a complicated way on the values. I do not want to load the entire file into memory, because it is huge. One solution is to throw my function into the IO monad and read out elements as they are needed, but I call this "giving up". mmap lets us treat on-disk data as if it were in memory, essentially doing lazy IO with help from the OS' virtual memory system. This is nice, but since determining which elements of the data are needed requires accessing a lot of the file, mmap seems to keep way too much of the file in memory. In practice, I find that reading the data I need to determine the data I actually need loads the entire file into memory when using mmap.
What options do I have?
I would suggest that you write an interface that is entirely in IO, where you have an abstract type that contains both a Handle and information about the overall structure of your data (perhaps all the m_is if you can fit them), and this is complemented by IO operations that read out precise bits of the data by seeking in the handle.
I would then simply wrap this interface in a bunch of unsafePerformIO calls! This is effectively what mmap does behind the scenes, in a sense. You just are doing so in a more explicitly managed way.
Assuming you aren't worried about anyway "swapping out" the file behind your back, you can get an interface that you can reason about purely while it actually does IO where necessary to give the explicit control over memory you need.

Misuse of a variables value?

I came across an instance where a solution to a particular problem was to use a variable whose value when zero or above meant the system would use that value in a calculation but when less than zero would indicate that the value should not be used at all.
My initial thought was that I didn't like the multipurpose use of the value of the variable: a.) as a range to be using in a formula; b.) as a form of control logic.
What is this kind of misuse of a variable called? Meta-'something' or is there a classic antipattern that this fits?
Sort of feels like when a database field is set to null to represent not using a value and if it's not null then use the value in that field.
Update:
An example would be that if a variable's value is > 0 I would use the value if it's <= 0 then I would not use the value and decided to perform some other logic.
Values such as these are often called "distinguished values". By far the most common distinguished value is null for reference types. A close second is the use of distinguished values to indicate unusual conditions (e.g. error return codes or search failures).
The problem with distinguished values is that all client code must be aware of the existence of such values and their associated semantics. In practical terms, this usually means that some kind of conditional logic must be wrapped around each call site that obtains such a value. It is far too easy to forget to add that logic, obtaining incorrect results. It also promotes copy-and-paste code as the boilerplate code required to deal with the distinguished values is often very similar throughout the application but difficult to encapsulate.
Common alternatives to the use of distinguished values are exceptions, or distinctly typed values that cannot be accidentally confused with one another (e.g. Maybe or Option types).
Having said all that, distinguished values may still play a valuable role in environments with extremely tight memory availability or other stringent performance constraints.
I don't think what your describing is a pure magic number, but it's kind of close. It's similar to the situation in pre-.NET 2.0 where you'd use Int32.MinValue to indicate a null value. .NET 2.0 introduced Nullable and kind of alleviated this issue.
So you're describing the use of a variable who's value really means something other than it's value -- -1 means essentially the same as the use of Int32.MinValue as I described above.
I'd call it a magic number.
Hope this helps.
Using different ranges of the possible values of a variable to invoke different functionality was very common when RAM and disk space for data and program code were scarce. Nowadays, you would use a function or an additional, accompanying value (boolean, or enumeration) to determine the action to take.
Current OS's suggest 1GiB of RAM to operate correctly, when 256KiB was high very few years ago. Cheap disk space has gone from hundreds of MiB to multiples of TiB in a matter of months. Not too long ago I wrote programs for 640KiB of RAM and 10MiB of disk, and you would probably hate them.
I think it would be good to cope with code like that if it's just a few years old (refactor it!), and denounce it as bad practice if it's recent.

How is integer overflow exploitable?

Does anyone have a detailed explanation on how integers can be exploited? I have been reading a lot about the concept, and I understand what an it is, and I understand buffer overflows, but I dont understand how one could modify memory reliably, or in a way to modify application flow, by making an integer larger than its defined memory....
It is definitely exploitable, but depends on the situation of course.
Old versions ssh had an integer overflow which could be exploited remotely. The exploit caused the ssh daemon to create a hashtable of size zero and overwrite memory when it tried to store some values in there.
More details on the ssh integer overflow: http://www.kb.cert.org/vuls/id/945216
More details on integer overflow: http://projects.webappsec.org/w/page/13246946/Integer%20Overflows
I used APL/370 in the late 60s on an IBM 360/40. APL is language in which essentially everything thing is a multidimensional array, and there are amazing operators for manipulating arrays, including reshaping from N dimensions to M dimensions, etc.
Unsurprisingly, an array of N dimensions had index bounds of 1..k with a different positive k for each axis.. and k was legally always less than 2^31 (positive values in a 32 bit signed machine word). Now, an array of N dimensions has an location assigned in memory. Attempts to access an array slot using an index too large for an axis is checked against the array upper bound by APL. And of course this applied for an array of N dimensions where N == 1.
APL didn't check if you did something incredibly stupid with RHO (array reshape) operator. APL only allowed a maximum of 64 dimensions. So, you could make an array of 1-64 dimension, and APL would do it if the array dimensions were all less than 2^31. Or, you could try to make an array of 65 dimensions. In this case, APL goofed, and surprisingly gave back a 64 dimension array, but failed to check the axis sizes.
(This is in effect where the "integer overflow occurred"). This meant you could create an array with axis sizes of 2^31 or more... but being interpreted as signed integers, they were treated as negative numbers.
The right RHO operator incantation applied to such an array to could reduce the dimensionaly to 1, with an an upper bound of, get this, "-1". Call this matrix a "wormhole" (you'll see why in moment). Such an wormhole array has
a place in memory, just like any other array. But all array accesses are checked against the upper bound... but the array bound check turned out to be done by an unsigned compare by APL. So, you can access WORMHOLE[1], WORMHOLE[2], ... WORMHOLE[2^32-2] without objection. In effect, you can access the entire machine's memory.
APL also had an array assignment operation, in which you could fill an array with a value.
WORMHOLE[]<-0 thus zeroed all of memory.
I only did this once, as it erased the memory containing my APL workspace, the APL interpreter, and obvious the critical part of APL that enabled timesharing (in those days it wasn't protected from users)... the terminal room
went from its normal state of mechanically very noisy (we had 2741 Selectric APL terminals) to dead silent in about 2 seconds.
Through the glass into the computer room I could see the operator look up startled at the lights on the 370 as they all went out. Lots of runnning around ensued.
While it was funny at the time, I kept my mouth shut.
With some care, one could obviously have tampered with the OS in arbitrary ways.
It depends on how the variable is used. If you never make any security decisions based on integers you have added with input integers (where an adversary could provoke an overflow), then I can't think of how you would get in trouble (but this kind of stuff can be subtle).
Then again, I have seen plenty of code like this that doesn't validate user input (although this example is contrived):
int pricePerWidgetInCents = 3199;
int numberOfWidgetsToBuy = int.Parse(/* some user input string */);
int totalCostOfWidgetsSoldInCents = pricePerWidgetInCents * numberOfWidgetsToBuy; // KA-BOOM!
// potentially much later
int orderSubtotal = whatever + totalCostOfWidgetInCents;
Everything is hunky-dory until the day you sell 671,299 widgets for -$21,474,817.95. Boss might be upset.
A common case would be code that prevents against buffer overflow by asking for the number of inputs that will be provided, and then trying to enforce that limit. Consider a situation where I claim to be providing 2^30+10 integers. The receiving system allocates a buffer of 4*(2^30+10)=40 bytes (!). Since the memory allocation succeeded, I'm allowed to continue. The input buffer check won't stop me when I send my 11th input, since 11 < 2^30+10. Yet I will overflow the actually allocated buffer.
I just wanted to sum up everything I have found out about my original question.
The reason things were confusing to me was because I know how buffer overflows work, and can understand how you can easily exploit that. An integer overflow is a different case - you cant exploit the integer overflow to add arbitrary code, and force a change in the flow of an application.
However, it is possible to overflow an integer, which is used - for example - to index an array to access arbitrary parts of memory. From here, it could be possible to use that mis-indexed array to override memory and cause the execution of an application to alter to your malicious intent.
Hope this helps.

Resources