I have a DecisionTree object to create a Machine Learning model. DecisionTree has a lot of fields in it representing settings. Each field has a default value, and in most cases only one or two of these fields need to be changed.
The issue is, the DecisionTree is computationally expensive to actually build. So instead of constructing the model when it's made, I have the maker only parse and save the data. The model won't be built until DecisionTree.build is called. This allows for changing the settings before having to build. However, it also means DecisionTree.predict will fail if called before build is.
I know it's good practice to make Objects be in always a valid state. But this means building the tree in the constructor, which is expensive, and then if any settings are changed it has to be built again.
Example 1: Build call is seperate
DecisionTree tree = new DecisionTree(data, classes, attributes);
tree.predict(item); //This would error
tree.maxDepth = 15;
tree.infoGain = 0.5;
tree.build();
tree.predict(item) // Now it would work
Example 2: Build call included, settings not in constructor
DecisionTree tree = new DecisionTree(data, classes, attributes); // This would take a long time to complete
tree.predict(item); //This would now work
tree.maxDepth = 15;
tree.infoGain = 0.5;
tree.build(); // This would once again take a long time to complete
tree.predict(item) // Done, but takes twice as long as the previous example
Example 3
DecisionTree tree = new DecisionTree(data, classes, attributes, null, null, 15, null, null, 0.5, null, null, null); // Settings are all included in constructor
tree.predict(item); //This would immediately be callable
My question is, are these 3 options the only ways to deal with many settings? What is the standard/best practice for this?
I don't think that it's a bad practice to fit algorithm with additional method, look for example at scikit-learn, they provide additional methid to fit object, constructor itself just initializes internal variables, and if you call predict before fit it just throws NotFittedError. Besides that, maybe in future you would want to extend your algo to work for example with minibatches, and in this case it's impossible to call constructor more than once, thus you will need something like partial_fit method, to fit classifier on additional chunk of data. So you cannot do everything at constructor.
If you have a big number of parameters in initialization, maybe you will find useful Builder pattern
Related
The principle of always valid domain model dictates that value object and entities should be self validating to never be in an invalid state.
This requires creating some kind of wrapper, sometimes, for primitive values. However it seem to me that this might break the ubiquitous language.
For instance say I have 2 entities: Hotel and House. Each of those entities has images associated with it which respect the following rules:
Hotels must have at least 5 images and no more than 20
Houses must have at least 1 image and no more than 10
This to me entails the following classes
class House {
HouseImages images;
// ...
}
class Hotel {
HotelImages images;
}
class HouseImages {
final List<Image> images;
HouseImages(this.images) : assert(images.length >= 1),
assert(images.length <= 10);
}
class HotelImages {
final List<Image> images;
HotelImages(this.images) : assert(images.length >= 5),
assert(images.length <= 20);
}
Doesn't that break the ubiquitous languages a bit ? It just feels a bit off to have all those classes that are essentially prefixed (HotelName vs HouseName, HotelImages vs HouseImages, and so on). In other words, my value object folder that once consisted of x, y, z, where x, y and z where also documented in a lexicon document, now has house_x, hotel_x, house_y, hotel_y, house_z, hotel_z and it doesn't look quite as english as it was when it was x, y, z.
Is this common or is there something I misunderstood here maybe ? I do like the assurance it gives though, and it actually caught some bugs too.
There is some reasoning you can apply that usually helps me when deciding to introduce a value object or not. There are two very good blog articles concerning this topic I would like to recommend:
https://enterprisecraftsmanship.com/posts/value-objects-when-to-create-one/
https://enterprisecraftsmanship.com/posts/collections-primitive-obsession/
I would like to address your concrete example based on the heuristics taken from the mentioned article:
Are there more than one primitive values that encapsulate a concept, i.e. things that always belong together?
For instance, a Coordinate value object would contain Latitude and Longitude, it would not make sense to have different places of your application knowing that these need to be instantiated and validated together as a whole. A Money value object with an amount and a currency identifier would be another example. On the other hand I would usually not have a separate value object for the amount field as the Money object would already take care of making sure it is a reasonable value (e.g. positive value).
Is there complexity and logic (like validation) that is worth being hidden behind a value object?
For instance, your HotelImages value object that defines a specific collection type caught my attention. If HotelImages would not be used in different spots and the logic is rather simple as in your sample I would not mind adding such a collection type but rather do the validation inside the Hotel entity. Otherwise you would blow up your application with custom value objects for basically everything.
On the other hand, if there was some concept like an image collection which has its meaning in the business domain and a set of business rules and if that type is used in different places, for instance, having a ImageCollection value object that is used by both Hotel and House it could make sense to have such a value object.
I would apply the same thinking concerning your question for HouseName and HotelName. If these have no special meaning and complexity outside of the Hotel and House entity but are just seen as some simple properties of those entities in my opinion having value objects for these would be an overkill. Having something like BuildingName with a set of rules what this name has to follow or if it even is consisting of several primitive values then it would make sense again to use a value object.
This relates to the third point:
Is there actual behaviour duplication that could be avoided with a value object?
Coming from the last point thinking of actual duplication (not code duplication but behaviour duplication) that can be avoided with extracting things into a custom value object can also make sense. But in this case you always have to be careful not to fall into the trap of incidental duplication, see also [here].1
Does your overall project complexity justify the additional work?
This needs to be answered from your side of course but I think it's good to always consider if the benefits outweigh the costs. If you have a simple CRUD like application that is not expected to change a lot and will not be long lived all the mentioned heuristics also have to be used with the project complexity in mind.
I have been working a a very dense set of calculations. It all is to support a specific problem I have.
But the nature of the problem is no different than this. Suppose I develop a class called 'Matrix' that has the machinery to implement matrices. Instantiation would presumably take a list of lists, which would be the matrix entries.
Now I want to provide a multiply method. I have two choices. First, I could define a method like so:
class Matrix():
def __init__(self, entries)
# do the obvious here
return
def determinant(self):
# again, do the obvious here
return result_of_calcs
def multiply(self, b):
# again do the obvious here
return
If I do this, the call signature for two matrix objects, a and b, is
a.multiply(b)...
The other choice is a #staticmethod. Then, the definition looks like:
#staticethod
def multiply(a,b):
# do the obvious thing.
Now the call signature is:
z = multiply(a,b)
I am unclear when one is better than the other. The free-standing function is not truly part of the class definition, but who cares? it gets the job done, and because Python allows "reaching into an object" references from outside, it seems able to do everything. In practice they'll (the class and the method) end up in the same module, so they're at least linked there.
On the other hand, my understanding of the #staticmethod approach is that the function is now part of the class definition (it defines one of the methods), but the method gets no "self" passed in. In a way this is nice because the call signature is the much better looking:
z = multiply(a,b)
and the function can access all the instances' methods and attributes.
Is this the right way to view it? Are there strong reasons to do one or the other? In what ways are they not equivalent?
I have done quite a bit of Python programming since answering this question.
Suppose we have a file named matrix.py, and it has a bunch of code for manipulating matrices. We want to provide a matrix multiply method.
The two approaches are:
define a free:standing function with the signature multiply(x,y)
make it a method of all matrices: x.multiply(y)
Matrix multiply is what I will call a dyadic function. In other words, it always takes two arguments.
The temptation is to use #2, so that a matrix object "carries with it everywhere" the ability to be multiplied. However, the only thing it makes sense to multiply it with is another matrix object. In such cases there are two equally good ways to do that, viz:
z=x.multiply(y)
or
z=y.multiply(x)
However, a better way to do it is to define a function inside the file that is:
multiply(x,y)
multiply(), as such, is a function any code using the 'library' expects to have available. It need not be associated with each matrix. And, since the user will be doing an 'import', they will get the multiply method. This is better code.
What I was wrongly confounding was two things that led me to the method attached to every object instance:
Functions which need to be generally available inside the file that should be
exposed outside it; and
Functions which are needed only inside the file.
multiply() is an example of type 1. Any matrix 'library' ought to likely define matrix multiplication.
What I was worried about was needing to expose all the 'internal' functions. For example, suppose we want to make externally available matrix add(), multiple() and invert(). Suppose, however, we did not want to make externally available - but needed inside - determinant().
One way to 'protect' users is to make determinant a function (a def statement) inside the class declaration for matrices. Then it is protected from exposure. However, nothing stops a user of the code from reaching in if they know the internals, by using the method matrix.determinant().
In the end it comes down to convention, largely. It makes more sense to expose a matrix multiply function which takes two matrices, and is called like multiply(x,y). As for the determinant function, instead of 'wrapping it' in the class, it makes more sense to define it as __determinant(x) at the same level as the class definition for matrices.
You can never truly protect internal methods by their declaration, it seems. The best you can do is warn users. the "dunder" approach gives warning 'this is not expected to be called outside the code in this file'.
I am creating a simple model of a network. The network contains nodes. Nodes send and receive data.
Here's one way to model the network: Each node has a "data" field representing the data possessed by the node at time t. Each node also has a "send" field recording the data sent to other nodes at time t.
sig Node {
data: Data -> Time,
send: Data -> Node -> Time
}
In the spectrum between 100% declarative and 100% imperative, I don't think that that signature is at the 100% declarative side. In my first paragraph I said nothing about nodes having data stores, nothing about keeping a record of sent data.
Also, isn't "send" a verb? Isn't that a sign that the model is not as declarative as it could be? Shouldn't declarative models exclusively use nouns?
I want my model to be at the 100% declarative end of the spectrum. To achieve that, I must simply state what "is". Let's do it!
What is is there are nodes:
sig Node {}
What is is that at any time t, a node has data.
sig Node {
data: Data -> Time
}
What is is that the network has wires between nodes.
sig Network {
wire: Node -> Node
}
What is is that data d is on a wire between time t and t' ...
Let me stop there. Is the second approach that I sketched out more declarative? Is there an approach that is even more declarative?
How to model a network in a 100% declarative manner?
I think you need to separate the semantic content of the model, which is one issue, from the names used in it. To me, the essence of a declarative model is that it records observations -- which may be about state, or about dynamic things like transitions -- expressed in logic. The alternative, an operational model, is to describe behavior and states by building up sequences of primitive actions. A simple example: modeling an action that picks an element non-deterministically from a set. An operational spec might treat the set as ordered, and then use a loop to walk through the set, at each step tossing a coin, and returning the given element when the toss first comes up heads. This models the intuitive operational description: "go through the elements of the set one at a time, and pick one to return". A declarative spec would simply say that the returned value is an element of the set -- nothing else needs to be said.
Regarding the use of names in your particular model, it seems that a larger issue to me is whether the names convey directly what they mean. So send to me is not a very helpful name; a better one would be sentAt.
Is there a faster way to search data in JavaScript (specifically on V8 via node.js, but without c/c++ modules) than using the JavaScript Object?
This may be outdated but it suggests a new class is dynamically generated for every single property. Which made me wonder if a binary tree implementation might be faster, however this does not appear to be the case.
The binary tree implementation isn't well balanced so it might get better with balancing (only the first 26 values are roughly balanced by hand.)
Does anyone have an idea on why or how it might be improved? On another note: does the dynamic class notion mean there are actually ~260,000 properties (in the jsperf benchmark test of the second link) and subsequently chains of dynamic class definitions held in memory?
V8 uses the concepts of 'maps', which describe the layout of the data in an object.
These maps can be "fast maps" which specify a fixed offset from the start of the object at which a particular property can be found, or they can be "dictionary map", which use a hashtable to provide a lookup mechanism.
Each object has a pointer to the map that describes it.
Generally, objects start off with a fast map. When a property is added to an object with a fast map, the map is transitioned to a new one which describes the location of the new property within the object. The object is re-allocated with enough space for the new data item if necessary, and the object's map pointer is set to the new map.
The old map keeps a record of the transitions from it, including a pointer to the new map and a description of the property whose addition caused the map transition.
If another object which has the old map gets the same property added (which is very common, since objects of the same type tend to get used in the same way), that object will just use the new map - V8 doesn't create a new map in this case.
However, once the number of properties goes over a certain theshold (in fact, the current metric is to do with the storage space used, not the actual number of properties), the object is changed to use a dictionary map. At this point the object is re-written using a hashtable. In general, it won't undergo any more map transitions - any more properties that are added will just go in the hashtable.
Fast maps allow V8 to generate optimized code (using Crankshaft) where the offset of a property within an object is hard-coded into the machine code. This makes it very fast for cases where it can do this - it avoids the need for doing any lookup.
Obviously, the generated machine code is then dependent on the map - if the object's data layout changes, the code has to be discarded and re-optimized when necessary. V8 has a type profiling mechanism which collects information about what the types of various objects are during execution of unoptimized code. It doesn't trigger optimization of the code until certain stability constraints are met - one of these is that the maps of objects used in the function aren't changing frequently.
Here's a more detailed description of how this stuff works.
Here's a video where one of the lead developers of V8 describes stuff like map transitions and lots more.
For your particular test case, I would think that it goes through a few hundred map transitions while properties are being added in the preparation loop, then it will eventually transition to a dictionary based object. It certainly won't go through 260,000 of them.
Regarding your question about binary trees: a properly sized hashtable (with a sensible hash function and a significant number of objects in it) will always outperform a binary tree for a use-case where you're just searching, as your test code seems to do (all of the insertion is done in the setup phase).
I would like to know a bit more about the bounding this error relates to.
I have an alloy model for which I create an instance manually (by writting it down in XML).
This instance is readable and the A4Solution can be correctly displayed.
But when I try to evaluate an expression in this instance using the eval() function, I receive this error message, though the field name and type of the exprvar retrieved from the model is exactly the same as the one in the instance..
I would like to know what does this bounding consist of. What are the properties taken into consideration to tell that one element of the instance is bounded to one element of the model.
Is the hidden ID figuring in the XML somewhere taken into consideration ?
I would like to know what does this bounding consist of.
It means that every free variable (i.e., relation in Kodkod) has to be given a bound before asking the solver to solve the formula.
What are the properties taken into consideration to tell that one element of the instance is bounded to one element of the model.
The instance XML file contains and exact value (a tuple set) for each and every sig and field from the model; those tuple sets are exactly the bounds that should be used when evaluating expressions. I'm not sure how exactly you are trying to use the Alloy API; but recreating the bounds from the XML file should (mostly) be handled by the API behind the scene.
Judging by your last comment to my previous answer, your problem might be related to this post.
In a nutshell, if you are trying to evaluate AST expression objects obtained from one "alloy world" (CompModule) against a solution that was entirely recreated from an XML file, you will get exactly that error. For example, if you have two PrimSig objects, s1 and s2, and they both have the same name (e.g., both were obtained by parsing sig S {...}), they are not "Java equals" (s1.equals(s2) returns false); the same holds for the underlying Kodkod relations/variables. So if you then try to evaluate s1 in a context which has a bound only for s2, you'll get an error saying that s1 is not bound.
Providing the set of Signatures defined in the metamodel while reading the A4Solution solved my problem.
Thus changing:
solution = A4SolutionReader.read(null, new XMLNode(f));
by
solution = A4SolutionReader.read(mm.getAllReachableSigs(), new XMLNode(f));
solved this illegal bounding issue