I have a data set and I am doing classification using Weka NaiveBayes classifier. I have 14 attributes, some of which are nominals.
In only one of these attributes, I have some missing values. What I have done so far is that I have left them as missing values, and I know that Weka replaces those values automatically (a question is asked here about that ).
I mean, the values for this attribute are empty in my feature file, and when I create the ARFF file, I see "?" between the two commas.
Now, I have two possibilities:
1) Let them be filled by Weka automatically.
2) Replace them by "NULL".
The problem is that in the first case, the classifier works better. Now, I am wondering if it is allowed to let them be replaced by Weka? Or should I use the second approach, even though I get worse results?
I mean, "when" should we let Weka replace the missing values? and when not?
Meanwhile, the feature which has missing values represents the WordNet supersense of the words and when it is empty, it means that the instance is, for example, a preposition, or a WH question.
Thanks in advance,
Well, about missing values, weka doesn't replace them by default, you have to use filter (exactly as in post you linked first in your question). Some classifiers can handle missing values, I think Naive Bayes can, just by don't count them in probability calculation. So basically you have three options. Use ReplaceMissingValues filter to replace missing values with mode values, don't use filter and use dataset with missing values (in this case I recommend you to have a look how Naive Bayes works, to understand how your missing values will be treated and if it is good for you) and final option, replace your missing values with your own label like "other values" or so. Probably the key for correct choice is in your last paragraph, that suggest that your missing values probably means something. If this is so, I will use third approach - your new label. On the other hand, if missing values doesn't means anything and are just result of some fault in data collection I will think about first two approaches. Good luck.
Related
I would like to know what kind of technique in Machine Learning domain can solve the problem below? (For example: Classification, CNN, RNN, etc.)
Problem Description:
User would input a string, and I would like to decompose the string to get the information I want. For example:
User inputs "R21TCCCUSISS", and after code decomposing, then I got the information: "R21" is product type, "TCC" is batch number, "CUSISS" is the place of origin
User inputs "TT3SUAWXCCAT", and after code decomposing, then I got the information: "TT3S" is product type, "SUAW" is batch number, "X" is a wrong character that user input , and "CCAT" is the place of origin
There are not fix string length in product type, batch number, and place of origin. Like product type may be "R21" or "TT3S", meaning that product type may comprise 2 or 3 character.
Also sometimes the string may contain wrong input information, like the "X" in example 2 shown above.
I’ve tried to find related solution, but what I got the most related is this one: https://github.com/philipperemy/Stanford-NER-Python
However, the string I got is not a sentence. A sentence comprises spaces & grammar, but the string I got doesn’t fit this situation.
Your problem is not reasonnably solved with any ML since you have a defined list of product type etc, since there may not be any actual simple logic, and since typically you are never working in the continuum (vector space etc). The purpose of ML is to build a regression function from few pieces of data and hope/expect a good generalisation (the regression fits all the unseen examples, past present and future).
Basically you are trying to reverse engineer the input grammar and generation (which was done by an algorithm, including possibly a random number generator). But in order to assert that your classifier function is working properly you need all your data to be also groundtruth, which breaks the ML principle.
You want to list all your list of defined product types (ground truth), and scatter bits of your input (with or without a regex pattern) into different types (batch number, place of origin). The "learning" is actually building a function (or few, one per type), element by element, which is filling a map (c++) or a dictionary (c#), and using it to parse the input.
I am trying to implement the ID3 algorithm, and am looking at the pseudo-code:
(Source)
I am confused by the bit where it says:
If examples_vi is empty, create a leaf node with label = most common value in TargegetAttribute in Examples.
Unless I am missing out on something, shouldn't this be the most common class?
That is, if we cannot split the data on an attribute value because no sample takes that value for the particular attribute, then we take the most common class among all samples and use that?
Also, isn't this just as good as picking a random class?
The training set tells us nothing about the relation between the attribute value and the class labels...
1) Unless I am missing out on something, shouldn't this be the most
common class?
You're correct, and the text also says the same. Look at the function description at the top :
Target_Attribute is the attribute whose value is to be predicted by the tree
so the value of Target_Attribute is the class/label.
2) That is, if we cannot split the data on an attribute value because no sample takes that value for the particular attribute, then we take the most common class among all samples and use that?
Yes, but not among all samples in your whole dataset, but rather those samples that reached up to this point in the tree/recursion. (ID3 functions is recursive and so the current Examples is actually Examples_vi of the caller)
3) Also, isn't this just as good as picking a random class?
The training set tells us nothing about the relation between the attribute value and the class labels...
No, picking a random class (with equal chances for each class) is not the same. Because often the inputs do have an unbalanced class distribution (this distribution is often called the prior distribution in many texts), so you may have 99% of examples are positive and only 1% negative. So whenever you really have no information whatsoever to decide on the outcome of some input, it makes sense to predict the most probable class, so that you have the most probability of being correct. This maximizes your classifier's accuracy on unseen data only under the assumption that the class distribution in your training data is the same as in the unseen data.
This explanation holds with the same reasoning for the base case when Attributes is empty (see 4 line in your pseudocode text); whenever we have no information, we just report the most common class of the data at hand.
If you never implemented the codes(ID3) but still want to know more in processing details, I suggest you to read this paper:
Building Decision Trees in Python
and here is the source code from the paper:
decision tree source code
This paper has a example or use example from your book(replace the "data" file with the same format). And you can just debug it (with some breakpoints) in eclipse to check the attribute values during the algorithms running.
Go over it, you will understand ID3 better.
I need to cluster a large number of strings using ELKI based on the Edit Distance / Levenshtein Distance. Since the data set is too large, I'd like to avoid file based precomputed distance matrices. How can I
(a) load string data in ELKI from a file (only "Labels")?
(b) implement a distance function accessing the labels (extend AbstractDBIDDistanceFunction, but how to get the labels?)
Some code snippets or example input files would be helpful.
It's actually pretty straightforward:
A) write a Parser that is adequate for your input file format (why try to reuse a parser written for numerical vectors with labels?), probably subclassing AbstractStreamingParser, producing a relation of the desired data type (probably you can just use String. If you want to be a bit more general TokenSequence may be a more appropriate concept for these distances. Strings are just the simplest case.
B) implement a DistanceFunction based on this vector type instead of DBIDs, i.e. a PrimitiveDistanceFunction<String>. Again, subclassing AbstractPrimitiveDistanceFunction may be the easiest thing to do.
For performance reasons, you may also want to look into indexing algorithms to retrieve e.g. the k most similar strings efficiently. I'm not sure which index structures exist for string edit distance and levenshtein distance.
A colleague has a student that apparently has some working token edit distances, but I have not seen or reviewed the code yet. As he is processing log files, he will probably be using a token based approach instead of characters.
Suppose I have a list of numbers and I've computed the q-quantile (using Quantile).
Now a new datapoint comes along and I want to update my q-quantile, without having stored the whole list of previous datapoints.
What would you recommend?
Perhaps it can't be done exactly without, in the worst case, storing all previous datapoints.
In that case, can you think of something that would work well enough?
One idea I had, if you can assume normality, is to use the inverse CDF instead of the q-quantile.
Keep track of the sample variance as you go and then you can compute InverseCDF[NormalDistribution[sampleMean,sampleVariance], q] which should be the value such that a fraction q of the values are smaller, which is what the q-quantile is.
(I see belisarius was thinking along the same lines.
Here's the link he pointed to: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm )
Unless you know that your underlying data comes from some distribution, it is not possible to update arbitrary quantiles without retaining the original data. You can, as others suggested, assume that the data has some sort of distribution and store the quantiles this way, but this is a rather restrictive approach.
Alternately, have you thought of programming this somewhere besides Mathematica? For example, you could create a class for your datapoints that contains (1) the Double value and (2) some timestamp for when the data came in. In a SortedList of these datapoints classes (which compares based on value), you could get the quantile very fast by simply referencing the index of the datapoints. Want to get a historical quantile? Simply filter on the timestamps in your sorted list.
I want to map some strings(word) with number. the similar the string, the nearer their value(mapped number) . also, while checking the positional combination of the letters should impact the mapping.the mapping function should be function of letters, positions (combination given position of letter thepriority such as pit and tip should be different), number of letters.
Well, I would give some examples : starter, stater , stapler, startler, tstarter are some words. These words are of format "(*optinal)sta(*opt)*er" where * denotes some sort of variable in our case it is either 't' or 'l' (i.e. in case of starter and staler). these all should be mapped INDIVIDUALLY, without context to other such that their value are not of much difference. and later on which creating groups I can put appropriate range of numbers for differentiating groups.
So while mapping the string their values should be similar. there are many words, so comparing each other would be complex. so mapping with some numeric value for each word independently and putting the similar string (as they have similar value) in a group and then later find these pattern by other means.
So, for now I need to look up for some existing methods of mapping such that similar strings (I guess I have clarify the term 'similar' for my context) have similar value and these value should be different to the dissimilar ones. please, again I emphasize that the number of string would be huge and comparing each with other is practically impossible(or computationally expensive and much slow).SO WHAT I THINK IS TO DEVISE AN ALGORITHM(taking help from existing ones) FOR MAPPING WORD(STRING) ON ITS OWN
Have I made you clear? Please give me some idea to start with. some terms to search and research.
I think I need some type of "bad" hash function to hash strings and then put them in bucket according to that hash value. at least some idea or algorithm names.
Seems like it would best to use a known algorithm like Levenshtein Distance
This search on StackOverflow
reveals this question about finding-groups-of-similar-strings-in-a-large-set-of-strings, which links to this article describing a SimHash which sounds exactly like what you want.