How do I classify this value using a decision tree - decision-tree

Basically my decision tree can't classify a value using the normal algorithm.
I get to a node, and there are two options (say, sunny and windy), but at this node my value is different (for example, rainy).
Are there any methods to deal with this, e.g. change the tree or just estimate based on other data?
I was thinking of assigning the most common value at that node but this is just a guess.

Have you considered fuzzy logic for the rich/poor continuum? As for things that can't be expressed as a continuum, I can't think of a way it can be done. Rainy weather, for example, is so fundamentally different from sunny and windy weather in how we experience and react to it, I'm not sure how you expect a computer (or whatever it is you're writing your decision tree for) to figure out what to do. (Aside from simply having an "I don't know what to do" output state, but I'm assuming you wanted something more meaningful than that.)

The whole point in decision trees is that the options are complete and (hopefully) mutual exclusive.
If it is not you'll get into trouble. Redefine poor and rich to cover everything. (all incomes, all states of mind...)
But honestly, interpret such weather examples as what they are: just examples for a concept, not the holy grail of meteorology.

The issue here is that you've learned a decision from different data as you are using to classify it. More specific, your decision tree knows only two values (i.e., sunny and windy) for the attribute Weather. But your data for classification also allows the value rainy.
Since your decision tree has no observation when the weather was rainy, this value turns useless. In other words, you have to eliminate this value from your classification.
The only solution is to do data cleaning before using the decision tree as classifier.
You have two options:
1. Remove all observations/instances with Weather="rainy" from your data set because you can't classify them. The disadvantage is that all instances with Weather="rainy" are not classified.
2. For all observations/instances with Weather="rainy", remove the value or rather set it to unknown/null. In case that your decision tree can handle null values, it can classify all of your data set. If not, you still have a problem. In that case you should go for option 3.
3. Relearn your decision tree with Weather={sunny, windy, rainy}
(4). In your case the following is not an option. Replace "rainy" with either "sunny" or "rainy. There are different heuristics for that.

You are talking about the "normal algorithm", which is a quite blurry statement. I assume you are using a strictly-binary rooted decision tree, where the each internal node makes a binary split of the data. Thus, the condition evaluation at each internal node outputs a Boolean variable, which splits the data into the left node (true) and right node (false). In your case, you can have a categorical variable weather with two possible values in the training data, which makes only two possible node: weather==sunny or weather==windy. Hence, the rainy samples will be always on the right node, as it is not sunny and not windy.
In the following picture, the rainy samples will be classified as not sunny, not windy.

Related

Hypothesis search tree

I have a object with many fields. Each field has different range of values. I want to use hypothesis to generate different instances of this object.
Is there a limit to the number of combination of field values Hypothesis can handle? Or what does the search tree hypothesis creates look like? I don't need all the combinations but I want to make sure that I get a fair number of combinations where I test many different values for each field. I want to make sure Hypothesis is not doing a DFS until it hits the max number of examples to generate
TLDR: don't worry, this is a common use-case and even a naive strategy works very well.
The actual search process used by Hypothesis is complicated (as in, "lead author's PhD topic"), but it's definitely not a depth-first search! Briefly, it's a uniform distribution layered on a psudeo-random number generator, with a coverage-guided fuzzer biasing that towards less-explored code paths, with strategy-specific heuristics on top of that.
In general, I trust this process to pick good examples far more than I trust my own judgement, or that of anyone without years of experience in QA or testing research!

Embeddings vs text cleaning (NLP)

I am a graduate student focusing on ML and NLP. I have a lot of data (8 million lines) and the text is usually badly written and contains so many spelling mistakes.
So i must go through some text cleaning and vectorizing. To do so, i considered two approaches:
First one:
cleaning text by replacing bad words using hunspell package which is a spell checker and morphological analyzer
+
tokenization
+
convert sentences to vectors using tf-idf
The problem here is that sometimes, Hunspell fails to provide the correct word and changes the misspelled word with another word that don't have the same meaning. Furthermore, hunspell does not reconize acronyms or abbreviation (which are very important in my case) and tends to replace them.
Second approache:
tokenization
+
using some embeddings methode (like word2vec) to convert words into vectors without cleaning text
I need to know if there is some (theoretical or empirical) way to compare this two approaches :)
Please do not hesitate to respond If you have any ideas to share, I'd love to discuss them with you.
Thank you in advance
I post this here just to summarise the comments in a longer form and give you a bit more commentary. No sure it will answer your question. If anything, it should show you why you should reconsider it.
Points about your question
Before I talk about your question, let me point a few things about your approaches. Word embeddings are essentially mathematical representations of meaning based on word distribution. They are the epitome of the phrase "You shall know a word by the company it keeps". In this sense, you will need very regular misspellings in order to get something useful out of a vector space approach. Something that could work out, for example, is US vs. UK spelling or shorthands like w8 vs. full forms like wait.
Another point I want to make clear (or perhaps you should do that) is that you are not looking to build a machine learning model here. You could consider the word embeddings that you could generate, a sort of a machine learning model but it's not. It's just a way of representing words with numbers.
You already have the answer to your question
You yourself have pointed out that using hunspell introduces new mistakes. It will be no doubt also the case with your other approach. If this is just a preprocessing step, I suggest you leave it at that. It is not something you need to prove. If for some reason you do want to dig into the problem, you could evaluate the effects of your methods through an external task as #lenz suggested.
How does external evaluation work?
When a task is too difficult to evaluate directly we use another task which is dependent on its output to draw conclusions about its success. In your case, it seems that you should pick a task that depends on individual words like document classification. Let's say that you have some sort of labels associated with your documents, say topics or types of news. Predicting these labels could be a legitimate way of evaluating the efficiency of your approaches. It is also a chance for you to see if they do more harm than good by comparing to the baseline of "dirty" data. Remember that it's about relative differences and the actual performance of the task is of no importance.

Union find in python3

I know how to implement union find in general, but I was thinking of whether there would be a way to utilize the set structure in python to achieve the same result.
For example, we can union sets pretty easily. But I'm not sure how to determine if two elements are in the same set using just sets.
So, I am wondering if there is a data structure in python that would support such operation, other than the usual implementation?
You could always solve this problem by visualizing it as a tree and its nodes connecting to each other via the root, and then looking up the tree if you want to know if two nodes are connected. If the two nodes you are comparing has the same root (they are in the same tree), than they are connected.
To connect two nodes, just go to the root of each tree they are in, and make one root become the parent of the other.
This video will give you a great intuition about it:
https://www.youtube.com/watch?v=YIFWCpquoS8&list=PLUX6FBiUa2g4YWs6HkkCpXL6ru02i7y3Q&index=1
The connection between the tree nodes can be made via pointers in a language which supports it, but if your language dont (python), than you can create your own pointers by storing positions and links via an array.
The array would be such that its positions would represent your nodes, and the values inside it represents the connection of the specific node to its root. On the beginning, the position in the array is filled with the node number because the nodes has initially no parent, but as you connect nodes, the roots changes, and the array has to represent this. Actually, the value stored there is the identificator of the root.
But try visualizing the problem visually first instead of thinking of arrays and too much mathematical artificats. Visually dealing with it makes the solution sound banal, and can be a good guidance while writing code.
I say this because I have watched the video from Robert Sedgewick I just posted, with a graphical simulation of the solution, and implemented myself without paying too much attention to the code on his book. The intuition the video gave me is much more valuable than any mathematics.
It will help you to encapsulate the nodes into a class, with the following methods:
climbTreeFromNodeUpToRoot
setNewParentToThisNodeAndUpdateHeights
The first method, as the name says, takes you from a node and goes up the tree until finding the root of it, which is then returned.
If you compare two nodes with this method (actually, the roots returned by it), you know easily if they are connected by just comparing their roots.
Once you want to connected them, you go up the trees of both nodes, and ask one root to take the other one as its parent.
The trees can grow very big in height (sorry I dont use the official nomeclature, but this is the one that makes sense to me), so this simple approach will get very slow when you have to climb the tree at a later time.
To prevent trees from becoming to high, dont just set one root as the parent to another without criterium, but attach the smallest tree (in terms of height, not quantity of elements) to the highest one.
For this, you need to know the heights of each tree, and this information you can store on their respective root (via an extra array in your case, or an extra pointer from each node in other languages). This information should be updated everytime another tree connects to it.
It is not possible for a tree to know that she just got a new tree attached to it, so its important that every tree attaching to a second one informs the second as to update its height.
This information can be sent to the root of the second tree, and later used to judge (as writen before) which tree is the smallest. Remember, attaching a small tree to a big one instead of the opposite will save you incredible amounts of time.
Do you want something like this?
myset = ...
all(elt in myset for elt in (a,b))

Introductory reading on classifiers that are not "yes/no" naive Bayes

I want to manually implement a classifier for certain short strings of words, getting a "goodness" rank for each of them. I have made a naive Bayesian classifier which is basically spam-filter-like and scores strings based on previous "good"/"bad" ratings. So far so good.
Now, there are two problems that I want to solve (by properly understanding things)...
The question is - what would be good introductory material for below, not of "cookbook" variety but more systematic, and yet ideally shorter than a university statistics course :) Set of articles that is shorter than the book, or a good book. Aimed at programmers ideally.
The problems are:
first, in my system there are actually 3 types of user feedback - "good", "bad", and "neutral". Most items are neutral, and right now I simply don't include them in the ranking. I am wondering how these things are properly handled (I still need to obtain a single "goodness probability" per item, so if I calculate probability of good and bad separately, are there any pitfalls/proper methods to combining those).
Then, I want to remove the naive part from my classifier (i.e. take relations between words into account), so some different classifier may be in order. Or, I could add all pairs-triples-etc. of words as features, since the strings are short - this feels like a hack, but then again my CS/maths background is rusty enough and/or insufficient to say whether this is a valid technique.

String Matching Algorithms

I have a python app with a database of businesses and I want to be able to search for businesses by name (for autocomplete purposes).
For example, consider the names "best buy", "mcdonalds", "sony" and "apple".
I would like "app" to return "apple", as well as "appel" and "ple".
"Mc'donalds" should return "mcdonalds".
"bst b" and "best-buy" should both return "best buy".
Which algorithm am I looking for, and does it have a python implementation?
Thanks!
The Levenshtein distance should do.
Look around - there are implementations in many languages.
Levenshtein distance will do this.
Note: this is a distance, you have to calculate it to every string in your database, which can be a big problem if you have a lot of entries.
If you have this problem then record all the typos the users make (typo=no direct match) and offline build a correction database which contains all the typo->fix mappings. some companies do this even more clever, eg: google watches how users correct their own typos and learns the mappings from this.
Soundex or Metaphone might work.
I think what you are looking for is a huge field of Data Quality and Data Cleansing. I fear if you could find a python implementation regarding this as it has to be something which cleanses considerable amount of data in db which could be of business value.
Levensthein distance goes in the right direction but only half the way. There are several tricks to get it to use the half matches as well.
One would be to use a subsequence dynamic time warping (DTW is actually a generalization of levensthein distance). For this you relax the start and end cases when calcualting the cost matrix. If you only relax one of the conditions you can get autocompletion with spell checking. I am not sure if there is a python implementation available, but if you want to implement it for yourself it should not be more than 10-20 LOC.
The other idea would be to use a Trie for speed up, which can do DTW/Levensthein on multiple results simultaniously (huge speedup if your database is large). There is a paper on Levensthein on Tries at IEEE, so you can find the algorithm there. Again for this you would need to relax the final boundary condition, so you get partial matches. However since you step down in the trie you just need to check when you have fully consumed the input and then return all leaves.
check this one http://docs.python.org/library/difflib.html
it should help you

Resources