Update the quantile for a dataset when a new datapoint is added - statistics

Suppose I have a list of numbers and I've computed the q-quantile (using Quantile).
Now a new datapoint comes along and I want to update my q-quantile, without having stored the whole list of previous datapoints.
What would you recommend?
Perhaps it can't be done exactly without, in the worst case, storing all previous datapoints.
In that case, can you think of something that would work well enough?

One idea I had, if you can assume normality, is to use the inverse CDF instead of the q-quantile.
Keep track of the sample variance as you go and then you can compute InverseCDF[NormalDistribution[sampleMean,sampleVariance], q] which should be the value such that a fraction q of the values are smaller, which is what the q-quantile is.
(I see belisarius was thinking along the same lines.
Here's the link he pointed to: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm )

Unless you know that your underlying data comes from some distribution, it is not possible to update arbitrary quantiles without retaining the original data. You can, as others suggested, assume that the data has some sort of distribution and store the quantiles this way, but this is a rather restrictive approach.
Alternately, have you thought of programming this somewhere besides Mathematica? For example, you could create a class for your datapoints that contains (1) the Double value and (2) some timestamp for when the data came in. In a SortedList of these datapoints classes (which compares based on value), you could get the quantile very fast by simply referencing the index of the datapoints. Want to get a historical quantile? Simply filter on the timestamps in your sorted list.

Related

Data Structure Question: Is there a link between the size of a list in a chaining implementation of hash maps and its load factor?

For example, if I have n keys and m slots in the hash map, the average size of a linked list starting from a slot would be n/m. Am I correct in thinking this? Again, I'm talking about an average. Thanks in advance!
I'm trying to learn data structures.
As you say, the average size of a single list is generally going to be the table's load factor; but this is assuming that the "Simple Uniform Hashing Assumption" holds with your hash table (more specifically, with its hash function(s) and expected input keys): simply put, we assume that the hash function distributes elements to buckets uniformly, as well as independently of one another.
To expand a little, and in different words:
We assume that if we choose a new item randomly (imagine sampling an item from the probability distribution that characterizes our inputs), then there is an equal chance that the item we end up with will be mapped to any of the m buckets. (A chance of 1/m.)
Furthermore, that this probability is unaffected given the presence (or absence) of any other elements in any of the buckets.
This is helpful because from this we can conclude that the probability for an item to be sorted into a given bucket is always 1/m, regardless of any other circumstances; from this it directly follows that the expected (average) length of a single bucket's list will be n/m (we insert n elements into the table, and for each one, sort it into this given list at a probability of 1/m).
To see that this is important, we might imagine a case in which it doesn't hold: for instance, if we're facing some kind of "attack" and our inputs are engineered to all hash into the same bucket, or even just with a high probability. In this case SUHA no longer holds, and clearly neither does the link you've asked about between the length of a list and the load factor.
This is part of the reason that it is important to choose a good hash function for your use case: without it, the assumption may not hold which could have a harmful effect on your lookup times.

Missing values for nominal attribute in Weka

I have a data set and I am doing classification using Weka NaiveBayes classifier. I have 14 attributes, some of which are nominals.
In only one of these attributes, I have some missing values. What I have done so far is that I have left them as missing values, and I know that Weka replaces those values automatically (a question is asked here about that ).
I mean, the values for this attribute are empty in my feature file, and when I create the ARFF file, I see "?" between the two commas.
Now, I have two possibilities:
1) Let them be filled by Weka automatically.
2) Replace them by "NULL".
The problem is that in the first case, the classifier works better. Now, I am wondering if it is allowed to let them be replaced by Weka? Or should I use the second approach, even though I get worse results?
I mean, "when" should we let Weka replace the missing values? and when not?
Meanwhile, the feature which has missing values represents the WordNet supersense of the words and when it is empty, it means that the instance is, for example, a preposition, or a WH question.
Thanks in advance,
Well, about missing values, weka doesn't replace them by default, you have to use filter (exactly as in post you linked first in your question). Some classifiers can handle missing values, I think Naive Bayes can, just by don't count them in probability calculation. So basically you have three options. Use ReplaceMissingValues filter to replace missing values with mode values, don't use filter and use dataset with missing values (in this case I recommend you to have a look how Naive Bayes works, to understand how your missing values will be treated and if it is good for you) and final option, replace your missing values with your own label like "other values" or so. Probably the key for correct choice is in your last paragraph, that suggest that your missing values probably means something. If this is so, I will use third approach - your new label. On the other hand, if missing values doesn't means anything and are just result of some fault in data collection I will think about first two approaches. Good luck.

Hashmap Inserts O(N)?

So, let's say you have a hashmap that uses linear probing.
You first insert a value X with key X, which hashes to location 5, say.
You then insert a value Y with key Y, which also hashes to 5. It will take location 6.
You then insert a value Z with key Z, which also hashes to 5. It will take location 7.
You then delete Y, so the memory looks like "X, null, Z"
You then try to insert a value with key Z, it will check 5, see it's taken, check 6, and then insert it there as its empty. However, there is already an entry with key Z, so you'll have two entries with key Z, which is against the invariant.
So wouldn't you therefore need to go through the entire map until you found the value itself. If it's not found, then you can insert it into the first null space. Therefore wouldn't all first-time inserts on a certain key be O(N)?
No.
The problem you're running into is caused by the deletion, which you've done incorrectly.
In fact, deletion from a table using linear probing is somewhat difficult -- to the point that many tables built using linear probing simply don't support deletion at all.
That said: at least in theory, nearly all operations on a hash table can end up linear in the worst case (insertion, deletion, lookup, etc.) Regardless of how clever a hash function you write, there are infinite inputs that can hash to any particular output. With a sufficiently unfortunate choice of inputs (or just a poor hash function) you can end up with an arbitrary percentage all producing the same hash code.
Edit: if you insist on supporting deletion with linear probing, the basic idea is that you need to ensure that each "chain" of entries remains contiguous. So, you hash the key, then walk from there all the way to the next empty bucket. You check the hash code for each of those entries, and fill the "hole" with the last contiguous item that hashed to a position before the hole. That, in turn, may create another hole that you have to fill in with the last item that hashed to a position before that hole you're creating (and so on, recursively).
Not sure why the village idiot (;)) deleted his post, since he was right -- an overcommitted/unbalanced hash table degenerates into a linear search.
To achieve O(1) performance the table must not be overcommitted (the table must be sufficiently oversized, given the number of entries), and the hash algorithm must do a good job (avoiding imbalance), given the characteristics/statistics of the key value.
It should be noted that there are two basic hash table schemes -- linear probing, where hash synonyms are simply inserted into the next available table slot, and linked lists, where hash synonyms are added to a linked list off the table element for the given hash value. They produce roughly the same statistics until overcommitted/unbalanced, at which point linear probing quickly falls completely apart while linked lists simply degrade slowly. And, as someone else stated, linear probing makes deletions very difficult.

Clustering string data with ELKI

I need to cluster a large number of strings using ELKI based on the Edit Distance / Levenshtein Distance. Since the data set is too large, I'd like to avoid file based precomputed distance matrices. How can I
(a) load string data in ELKI from a file (only "Labels")?
(b) implement a distance function accessing the labels (extend AbstractDBIDDistanceFunction, but how to get the labels?)
Some code snippets or example input files would be helpful.
It's actually pretty straightforward:
A) write a Parser that is adequate for your input file format (why try to reuse a parser written for numerical vectors with labels?), probably subclassing AbstractStreamingParser, producing a relation of the desired data type (probably you can just use String. If you want to be a bit more general TokenSequence may be a more appropriate concept for these distances. Strings are just the simplest case.
B) implement a DistanceFunction based on this vector type instead of DBIDs, i.e. a PrimitiveDistanceFunction<String>. Again, subclassing AbstractPrimitiveDistanceFunction may be the easiest thing to do.
For performance reasons, you may also want to look into indexing algorithms to retrieve e.g. the k most similar strings efficiently. I'm not sure which index structures exist for string edit distance and levenshtein distance.
A colleague has a student that apparently has some working token edit distances, but I have not seen or reviewed the code yet. As he is processing log files, he will probably be using a token based approach instead of characters.

Updating TimeUUID columns in cassandra

I'm trying to store some time series data on the following column family:
create column family t_data with comparator=TimeUUIDType and default_validation_class=UTF8Type and key_validation_class=UTF8Type;
I'm successfully inserting data this way:
data={datetime.datetime(2013, 3, 4, 17, 8, 57, 919671):'VALUE'}
key='row_id'
col_fam.insert(key,data)
As you can see, using a datetime object as the column name pycassa converts to a timeUUID object correctly.
[default#keyspace] get t_data[row_id];
=> (column=f36ad7be-84ed-11e2-af42-ef3ff4aa7c40, value=VALUE, timestamp=1362423749228331)
Sometimes, the application needs to update some data. The problem is that when I try to update that column, passing the same datetime object, pycassa creates a different UUID object (the time part is the same) so instead of updating the column, it creates another one.
[default#keyspace] get t_data[row_id];
=> (column=f36ad7be-84ed-11e2-af42-ef3ff4aa7c40, value=VALUE, timestamp=1362423749228331)
=> (column=**f36ad7be**-84ed-11e2-b2fa-a6d3e28fea13, value=VALUE, timestamp=1362424025433209)
The question is, how can I update TimeUUID based columns with pycassa passing the datetime object? or, if this is not the correct way to doing it, what is the recommended way?
Unless you do a read-modify-write you can't. UUIDs are by their nature unique. They exist to solve the problem of how to get unique IDs that sort in chronological order but at the same time avoid collisions for things that happen at exactly the same time.
So to update that column you need to first read it, so you can find its column key, change its value and write it back again.
It's not a particularly elegant solution. You should really avoid read-modify-write in Cassandra. Perhaps TimeUUID isn't the right type for your column keys? Or perhaps there's another way you can design your application to avoid having to go back and change things.
Without knowing what your query patterns look like I can't say exactly what you should do instead, but here are some suggestions that hopefully are relevant:
Don't update values, just write new values. If something was true at time T will always have been true for time T, even if it changes at time T + 1. When things change you write a new value with the time of the change and let the old values be. When you read the time line you resolve these conflics by picking the most recent value -- and since the values will be sorted in chronological order the most recent value will always be the last one. This is very similar to how Cassandra does things internally, and it's a very powerful pattern.
Don't worry that this will use up more disk space, or require some extra CPU when reading the time series, it will most likely be tiny in comparison with the read-modify-write complexity that you would otherwise have to implement.
There might be other ways to solve your problem, and if you give us some more details maybe we can come up with someting that fits better.

Resources