Formal UML representation of reshaping a data frame - uml

For documentation of the restructuring of a data table from "wide" using a criteria column for each score to using a score column and a criterion column my first reaction was to use UML class diagram.
I am aware that by changing the structure of the data table, the class attributes have not changed.
My first question is whether the wide or the long version is the more correct representation of the data table?
My second question is whether it would make sense to relate the two representations - and if so, by which relationship?
My third question would be whether something else than a UML class diagram would be more suitable for documenting the reshaping (data preprocessing before showing distribution as a box pot in R).

You jumped a little bit to fast from the table to the UML. This makes your question very confusing, because what is wide as a table is represented long as a class, and the contrary.
Reformulating your problem, it appears that you are refactoring some tables. The wide table shows several values for a same student in the same row. This means that the maximum number of exercises is fixed by the table structure:
ID Ex1 Ex2 Ex3 .... Ex N
-----------------------------
111 A A A ... A
119 A C - ... D
127 B F B ... F
The long table has fewer columns, and each row shows only 1 specific score of 1 specific student:
ID # Score
---------------
111 1 A
111 2 A
111 3 A
...
111 N A
119 1 A
119 2 C
...
You can model this structure in an UML class diagram. But in UML, the table layout doesn't matter: that's an issue of the ORM mapping and you could perfectly have one class model (with an attribute or an association having a multiplicity 1..N) that could be implemented using either the wide or the long version. If the multiplicity would be 1..* only the long option would work.
Now to your questions:
Both representations are correct; they just have different characteristics. The wide is inflexible, since the maximum number of scores is fixed by the table structure. Also adding a new score requires in fact to update a record (so the possible concurrency of both models is not the same). The long is a little more complex to use if you want to show history of a student scores in a row.
Yes it makes sense to relate both, especially if you're writing for a transformation of the first into the second.
UML would not add necessarily value here. If you're really about tables and values, you could as well use an Entity/Relationship diagram. But UML has the advantage of allowing database modelling as well and it lets you add behavioral aspects. If not now, then later. You could consider using the non-standard «table» stereotype, to clarify what you are modelling a table (so a low level view on your design).

Related

Converting NLP to CSP: Story Consistency

Background: I would like to know if anyone has succeeded in converting Natural Language to a knowledge base representative of a constraint satisfaction problem. I want preform constraint satisfaction on a person's statements in order to see if any inconsistencies are present while preforming a resolution proof on the statements. This could be used in a courtroom or during election debates.
So to lay out my idealistic story consistency algorithm:
A first statement comes in, convert it and add it to the Knowledge Base (KB)
While next
get next statement
convert statement to clause
negate clause,
add negated clause to KB
check for contradiction (perform resolution)
report finding
remove the original clause to see if the story changes again
add the new clause
How would I convert a statement to a usable clause?
For example:
~A B C
a ~B C

UML Circular reference with both aggregation and composition

A few days ago a friend pointed out to me that I had a wrong idea of composition in UML. She was completely right, so I decided to find out what more I could have been wrong about. Right now, there is one more thing that I have doubts about: I have a circular dependency in my codebase that I would like to present in UML form. But how.
In my case the following is true:
Both A and B have a list of C
C has a reference to both A and B to get information from.
C cannot exist if either A or B stops to exist
Both A and B remain to exist after C is deleted from A and/or B
To model this, I've come up with the following UML (I've ommited multiplicities for now, to not crowd the diagram.)
My question is, is this the right way to model such relations?
Problems
Some facts to keep in mind:
Default multiplicity makes your model invalid. A class may only be composed in one other class. When you don't specify multiplicity, you get [1..1]. That default is sad, but true.
The UML spec doesn't define what open-diamond aggregation means.
Your model has many duplicate properties. There is no need for any of the properties in the attribute compartments, as there are already unnamed properties at the ends of every association.
Corrections
Here is a reworking of your model to make it more correct:
Notice the following:
The exclusive-or constraint between the associations means only one of them can exist at a time.
Unfortunately, the multiplicities allow an instance of C to exist without being composed by A or B. (See the reworked model below.)
The property names at the ends of all associations explicitly name what were unnamed in your model. (I also attempted to indicate purpose in the property names.)
The navigability arrows prevent multiple unwanted properties without resorting to duplicative attributes.
Suggested Design
If I correctly understand what your model means, here is how I would probably reverse the implementation into design:
Notice the following:
Class D is abstract (the class name is in italics), meaning it can have no direct instances.
The generalization set says:
An instance cannot be multiply classified by A and B. (I.e., A and B are {disjoint}.)
An instance of D must be an instance of one of the subclasses. (I.e., A and B are {complete}, which is known as a covering axiom.)
The subclasses inherit the ownedC property from class D.
The composing class can now have a multiplicity of [1..1], which no longer allows an instance of C to exist without being composed by an A or a B.
Leave away the open diamonds and make them normal associations. These are no shared aggregations but simple associations. The composite aggregations are ok.
In general there is not much added value in showing aggregations at all. The semantic added value is very low. In the past this was a good hint to help the garbage collection dealing with unneeded objects. But nowadays almost all target languages have built-in efficient garbage collectors. Only in cases where you want an explicit deletion of the aggregated objects you should use the composite aggregation.

ID3 Implementation Clarification

I am trying to implement the ID3 algorithm, and am looking at the pseudo-code:
(Source)
I am confused by the bit where it says:
If examples_vi is empty, create a leaf node with label = most common value in TargegetAttribute in Examples.
Unless I am missing out on something, shouldn't this be the most common class?
That is, if we cannot split the data on an attribute value because no sample takes that value for the particular attribute, then we take the most common class among all samples and use that?
Also, isn't this just as good as picking a random class?
The training set tells us nothing about the relation between the attribute value and the class labels...
1) Unless I am missing out on something, shouldn't this be the most
common class?
You're correct, and the text also says the same. Look at the function description at the top :
Target_Attribute is the attribute whose value is to be predicted by the tree
so the value of Target_Attribute is the class/label.
2) That is, if we cannot split the data on an attribute value because no sample takes that value for the particular attribute, then we take the most common class among all samples and use that?
Yes, but not among all samples in your whole dataset, but rather those samples that reached up to this point in the tree/recursion. (ID3 functions is recursive and so the current Examples is actually Examples_vi of the caller)
3) Also, isn't this just as good as picking a random class?
The training set tells us nothing about the relation between the attribute value and the class labels...
No, picking a random class (with equal chances for each class) is not the same. Because often the inputs do have an unbalanced class distribution (this distribution is often called the prior distribution in many texts), so you may have 99% of examples are positive and only 1% negative. So whenever you really have no information whatsoever to decide on the outcome of some input, it makes sense to predict the most probable class, so that you have the most probability of being correct. This maximizes your classifier's accuracy on unseen data only under the assumption that the class distribution in your training data is the same as in the unseen data.
This explanation holds with the same reasoning for the base case when Attributes is empty (see 4 line in your pseudocode text); whenever we have no information, we just report the most common class of the data at hand.
If you never implemented the codes(ID3) but still want to know more in processing details, I suggest you to read this paper:
Building Decision Trees in Python
and here is the source code from the paper:
decision tree source code
This paper has a example or use example from your book(replace the "data" file with the same format). And you can just debug it (with some breakpoints) in eclipse to check the attribute values during the algorithms running.
Go over it, you will understand ID3 better.

Precise semantic of annotated associations with { bag } and multiplicity bound constraints

Suppose I have A ---r1 {bag} [1..2]--> B in a UML class diagram (that is, r1 is an association from A to B and is annotated with {bag} and multiplicity [1..2].
My Question: if a:A is an instance of A, is the following collection valid?
a.r1={(b1,1),(b1,2),(b2,1)} //collection contains two copies of b1 and one b2
In other words, multiplicity bounds (i.e., [1..2]) apply to the association when it is interpreted purely as r1:A --> B, or it applies to r1: A --> Bag(B)? In the former interpretation, the above collection is valid, since r1 contains at most two instances of B, but in the latter it is not, since r1 contains three elements of Bag(B)! which interpretation is correct?
Multiplicity constraints in UML are explained in Chapter 7.5.3 of UML document as I am referred to in this question.
p.s.1: A similar question arises when we substitute {bag} with {seq}.
p.s.2: I added haskell tag to get comment from large haskell community here as #xmojmr suggested. Thanks to #peter that nicely draw the pictures in his answer.
As stated in specs, Bag is unordered, nonunique collection.
However this describes the relation between the elements you are pointing to.
So your example can be expressed in either way:
This means that A has reference to one to two B instances, and those references are stored in a Bag (or any nonunique, unsorted collection; but that is implementation detail).
To answer your question: no, because the Bag contains three instances of B, whilst the allowed maximum is two B's.

Matching Based on Arbitrary Categories and Similarity Measures

I have customer database who have certain attributes, and a customer type. The collection of attributes can vary (they do come from a finite set though), and when I look at a new customer with unknown type, with given attributes, I would like to determine which type s/he belongs to. For example, say I have these customers already in DB,
Customer | Type | Attributes
1 A 44,32,5,'X'
2 A 3,32,66,'A'
3 B 6,32,'A', 'B'
4 C 47,31,2,'H'
5 C 14,32,2,'O'
6 C 2,'C'
7 A 44
When I receive a new customer who has attributes, for example, 3,32,2, I would like to determine which type this customer belongs to, and the code should report its confidence (as percentage) of this match.
What is the best method to use here? Something statistical, or a method based on an affinity matrix of some kind, or recommendation engine style Pearson Correlation coefficients based approach? Sample, pseude code would be most welcome, but any, all ideas are fine.
Thanks,
The way to solve this problem is using Naive Bayes.

Resources