Extract most important features (per Class) using mutual_info_classif - scikit-learn

I'm using mutual_info_classif to determine the most important words for a binary text-classification task as:
mi_score = mutual_info_classif(X, y)
but the above gives an array of feature scores without reference to the corresponding classes
Is there a way to get the most important features per class using MI?
P.s., I've already tried Chi2 but it gives the same feature rank for both classes

Mutual information is a measure of dependence between 2 variables. In your case, between each of the attribute variables and the "Class" variable. The mutual information will give a higher score, when the attribute variable creates a better split of the target variable. This means you only get one score that describes the strength between the attribute and the class. The most important feature is the one that best distinguishes between all of the classes.
If you have a class with multiple labels (not a binary class), you can create a new class variable for each label by using dummy variables.
For example, assume your class name is CLASS, and it holds 3 different labels: "Red", "Green" and "Blue". Create 3 new target variables, the first will be called "Is_Red", and it will hold "Yes" if CLASS=="Red" or "No" Otherwise. In this manner you can see which attribute best distinguish between each specific instance of the class. You will have to run the mutual information per new class variable.

Related

How can measurements of health properties be modeled in a class diagram?

I have a User class that can "measure" some parameters associated to a date and input them in an application. So 1 User -> many parameters of many types associated to many dates (many measurements). The parameters types are fixed and can be both numeric or strings, e.g: weight, height, calories intake, some strings... which are represented as an enumeration.
Now my main problem is: does the fact that the parameters can be of different datatypes (numbers or strings) mean that the general parameter type has to have specialisations for the two subgroups of parameters? Or is the datatype for each type of parameter implied in the type itself? (e.g. a "weight" implies it should be a number)
How can the "Parameter" class be represented in a correct way considering that:
it can be both numerical or a string
there is also a superuser class that can add parameters for a specific user
the parameters the superuser can input are some of the ones the normal user can PLUS some other parameters exclusive to the superuser (say: fat body mass) so there is not a 1-1 correspondence
the numerical parameters have other additional attributes that can be modified by the superuser (for example: limit weight)
the superuser supposedly should be able to add "notes" for some parameters
My confusion stems from the fact that I have no background in OOP programming and i can't find any similar examples online. I just need an input towards the right direction to go to. Is the pictured diagram correct? And why it most likely isn't? The problem as of now would be how to implement the fact that the superuser can also add notes to some parameters.
Do I:
create a single parameter class with the enumeration type as attribute which automatically implies the datatype of the input e.g weight = number?
create two subclasses for each User, e.g. UserParameters and SuperUserParameters, although some parameters overlap?
leave it as is with some adjustments?
other better approach?
I'd like to propose using an improved terminology. Since your app is about (health) property measurements, I'll replace your class name "Parameter" with Measurement.
The following model should satisfy all of your requirements (except the one discussed below):
Notice that the two subclasses UserProperty and SpecialProperty simply define a partitioning of Property. They can be eliminated by adding an enumeration attribute propertyCategory to the Property class, having USER_PPROPERTY and SPECIAL_PPROPERTY as its enum literals.
The only requirement, which is not yet covered, is
the numerical parameters have other additional attributes that can be
modified by the superuser (for example: limit weight)
This needs further carification. If these "other additional attributes" form a fixed set, then they can be modeled as further attributes of the Property class.
I don't think you should do that on UML level at all. You are going into memory management/overlays. And those are implementation details you should not take care of. Rather you are dealing with HeartRate and Weight as distinct objects. They will not have a common "value", which is just some memory allocation. They are what they are and whether you need a string or a number is some property of the distinct business objects.

Including only certain features when running deep feature synthesis?

For example one of my entities has two sets of IDs.
One that is continuous (which apparently is necessary to create the EntitySet), and one to use as a foreign key when merging with my other table.
This results in featuretools including the ID in the set of features to aggregate. SUM(ID) isn't a feature I am interested in though.
Is there a way to include certain feature when running deep feature synthesis?
There are three ways to exclude features when calling ft.dfs.
Use the ignore_variables to specify variables in an entity that should not be used to create features. It is a dictionary mapping an entity id to a list of variable names to ignore.
Use drop_contains to drop features that contain any of the strings
listed in this parameter.
Use drop_exact to drop features that exactly match any of the strings listed in this parameter.
Here is a example usage of all three in a ft.dfs call
ft.dfs(target_entity="customers"],
ignore_variables={
"transactions": ["amount"],
"customers": ["age", "gender", "date_of_birth"]
}, # ignore these variables
drop_contains=["customers.SUM("], # drop features that contain these strings
drop_exact=["STD(transactions.quanity)"], # drop features named exactly this
...
)
These 3 parameters are all documented here.
The final thing to consider if you are getting features you don't want is the variable types of the variables in your entity set. If you are seeing the sum of an ID variable that must mean that featuretools thinks the ID variable is a numeric value. If you tell featuretools it is an ID it will not apply a numeric aggregation to it.

Uml class diagram alternative of map attribute

Through searching I found that in UML, aggregation(assuming that it's used properly) can be used to represent attributes in a class.
For example,
(Assume column can stand alone)
Then, using such example, if I want to replace the attribute: Column[] with map to represent the column's name, would it be correct to use an association class just like below? (In case, I'm not willing to put the column name in Column class as an attribute)
Association classes are used with simple associations. They have m-1-1-n multiplicity. The shared aggregation (as you used) has no defined semantics (and I recommend to simply not use it unless you have a domain specific and documented use for it). It's simply better to put the intended multiplicity on either side of an association.
An association class connects two classes, adding attributes and/or operations. Your example is "unconventional" since Table/Column have a simple relation which would not need an association class. A general example is the Student/Lecture relation where you can put an association class in between to record exam results, times etc.
Yes, I think that is a valid way of modelling the fact that you have some sort of key string that can be used to identify a Column of this Table.
Using a Map is one if the many possible implementations, so it's not a real equal to.
The advantage of modelling using the Association Class is that your model remains at the more abstract functional level and leaves out implementation details.
BTW. I would use a composition instead of an aggregation for the association between Table and Column, as there is an obvious strong ownership relation and life-cycle dependency between the two.
If you want to replace the attribute Column[] with map to represent the column's name and you are not willing to put the column name in Column class as an attribute and assuming that you want to follow UML specification precisely then you'll produce the model shown below:
Map<Key, Value> is usually understood as an associative container that contains key-value pairs Map.Entry<Key, Value> with unique keys. The container is modeled by the directional aggregation from Map<Key, Value> to Map.Entry<Key, Value>.
Map<Key, Value> and Map.Entry<Key, Value> are templates. Clause 7.3.3.1 of UML specification says that:
A template cannot be used in the same manner as a non-template Element of the same kind. The template Element can only be used to generate bound Elements or as part of the specification of another template.
According to clause 7.3.3.3:
A TemplateBinding is a relationship between a TemplateableElement and a template that specifies the substitutions of actual ParameterableElements for the formal TemplateParameters of the template.
Thus we have two bound elements that have TemplateBinding (marked by <> keyword) relationships with their templates:
ColumnNames which essentially is the name for Map<String, Column>
ColumnName which essentially is the name for Map.Entry<String, Column
According to clause 11.5.1 of the UML specification:
An Association classifies a set of tuples representing links between typed instances. An AssociationClass is both an Association and a Class.
ColumnName is AssociationClass representing the link between instances of String and Column classes. We use notation from clause 11.5.5, figure 11.35 to express that.
Finally, the directional composition association between Table and ColumnNames classes tells us that each instance of Table owns an instance of ColumnNames, i.e. set of column names.
Note that while ColumnNames and ColumName classes are usually hidden from the end-user by an implementation, they nevertheless exist.
I used BoUML to draw the diagram.

ID3 Implementation Clarification

I am trying to implement the ID3 algorithm, and am looking at the pseudo-code:
(Source)
I am confused by the bit where it says:
If examples_vi is empty, create a leaf node with label = most common value in TargegetAttribute in Examples.
Unless I am missing out on something, shouldn't this be the most common class?
That is, if we cannot split the data on an attribute value because no sample takes that value for the particular attribute, then we take the most common class among all samples and use that?
Also, isn't this just as good as picking a random class?
The training set tells us nothing about the relation between the attribute value and the class labels...
1) Unless I am missing out on something, shouldn't this be the most
common class?
You're correct, and the text also says the same. Look at the function description at the top :
Target_Attribute is the attribute whose value is to be predicted by the tree
so the value of Target_Attribute is the class/label.
2) That is, if we cannot split the data on an attribute value because no sample takes that value for the particular attribute, then we take the most common class among all samples and use that?
Yes, but not among all samples in your whole dataset, but rather those samples that reached up to this point in the tree/recursion. (ID3 functions is recursive and so the current Examples is actually Examples_vi of the caller)
3) Also, isn't this just as good as picking a random class?
The training set tells us nothing about the relation between the attribute value and the class labels...
No, picking a random class (with equal chances for each class) is not the same. Because often the inputs do have an unbalanced class distribution (this distribution is often called the prior distribution in many texts), so you may have 99% of examples are positive and only 1% negative. So whenever you really have no information whatsoever to decide on the outcome of some input, it makes sense to predict the most probable class, so that you have the most probability of being correct. This maximizes your classifier's accuracy on unseen data only under the assumption that the class distribution in your training data is the same as in the unseen data.
This explanation holds with the same reasoning for the base case when Attributes is empty (see 4 line in your pseudocode text); whenever we have no information, we just report the most common class of the data at hand.
If you never implemented the codes(ID3) but still want to know more in processing details, I suggest you to read this paper:
Building Decision Trees in Python
and here is the source code from the paper:
decision tree source code
This paper has a example or use example from your book(replace the "data" file with the same format). And you can just debug it (with some breakpoints) in eclipse to check the attribute values during the algorithms running.
Go over it, you will understand ID3 better.

Is there a way to specify a relationship whereby a class generates the code for another class? UML

I am writing a system which generates code for a number of classes and I need to document it with a UML diagram. The classes will follow the same structure but they will have names set by the user. Is there a way to specify that CCodeGenerator generates the code for these classes?
Also, I currently have a relationship between my CDataDefinition class (which defines what should be included in each of the generated classes) and the CCodeGenerator, is there a way to denote that the multiplicity of the relationship between the generated classes and the generator is exactly equal to the number of CDataDefinition instances?
These classes will be used in another system which will also need UML class diagrams made for it. Is there a way to show that a class in this project (CEditior) uses them?
Example of operation:
I have 3 CDataDefinition objects which define classes X, Y, and Z. My CCodeGenerator instance will create 3 classes (C# code in .cs files) from these.
CEditor in a separate solution will then interface with these 3 classes.
If you read some of the introductory information on MOF, you will see that in the UML family an instance of a metaclass in one layer is a classifier in the next.
In your case, a class in the code generator describing the class in its output will be a metaclass (CDataGenerator), and the classes in the output represented by instances of the metaclass.
There is no way in plain UML for associations other than 'X is of type Y' to cross between the layers.
You may be able to model such a relationship using MOV QVT (query, view, transform - i.e. a language for mapping one model to another), but I don't know current state of tool support for that, and if you had a QVT tool you probably wouldn't need to be writing a code generator.
You need to build a template class (CDataDefinition) that will represent the structure of a class that can be created by CCodeGeneratorWhen you're creating actual class you do the binding so all you have to do is show that CCodeGenerator has an operation (let's say) classGenerator(name:String) and then you can show that this method creates a class as a proper binding on CDataDefinition.

Resources