approximate histogram for streaming string values (card catalog algorithm?) - string

I have a large list (or stream) of UTF-8 strings sorted lexicographically. I would like to create a histogram with approximately equal values for the counts, varying the bin width as necessary to keep the counts even. In the literature, these are sometimes called equi-height, or equi-depth histograms.
I'm not looking to do the usual word-count bar chart, I'm looking for something more like an old fashioned library card catalog where you have a set of drawers (bins), and one might hold SAM - SOLD,and the next bin SOLE-STE, while all of Y-ZZZ fits in a single bin. I want to calculate where to put the cutoffs for each bin.
Is there (A) a known algorithm for this, similar to approximate histograms for numeric values? or (B) suggestions on how to encode the strings in a way that a standard numeric histogram algorithm would work. The algorithm should not require prior knowledge of string population.
The best way I can think to do it so far is to simply wait until I have some reasonable amount of data, then form logical bins by:
number_of_strings / bin_count = number_of_strings_in_each_bin
Then, starting at 0, step forward by number_of_strings_in_each_bin to get the bin endpoints.
This has two weaknesses for my use-case. First, it requires two iterations over a potentially very large number of strings, one for the count, one to find the endpoints. More importantly, a good histogram implementation can give an estimate of where in a bin a value falls, and this would be really useful.
Thanks.

If we can't make any assumptions about the data, you are going to have to make a pass to determine bin size.
This means that you have to either start with a bin size rather than bin number or live with a two-pass model. I'd just use linear interpolation to estimate positions between bins, then do a binary search from there.
Of course, if you can make some assumptions about the data, here are some that might help:
For example, you might not know the exact size, but you might know that the value will fall in some interval [a, b]. If you want at most n bins, make the bin size == a/n.
Alternatively, if you're not particular about exactly equal-sized bins, you could do it in one pass by sampling every m elements on your pass and dump it into an array, where m is something reasonable based on context.
Then, to find the bin endpoints, you'd find the element at size/n/m in your array.

The solution I came up with addresses the lack of up-front information about the population by using reservoir sampling. Reservoir sampling lets you efficiently take a random sample of a given size, from a population of an unknown size. See Wikipedia for more details. Reservoir sampling provides a random sample regardless of whether the stream is ordered or not.
We make one pass through the data, gathering a sample. For the sample we have explicit information about the number of elements as well as their distribution.
For the histogram, I used a Guava RangeMap. I picked the endpoints of the ranges to provide an even number of results in each range (sample_size / number_of_bins). The Integer in the map merely stores the order of the ranges, from 1 to n. This allows me to estimate the proportion of records that fall within two values: If there are 100 equal sized bins, and the values fall in bin 25 and bin 75, then I can estimate that approximately 50% of the population falls between those values.
This approach has the advantage of working for any Comparable data type.

Related

Decision trees: information gain - bias against attributes - how and why is it say so?

I am confused to get the context on biases in the following line (marked in bold):
Information gain ratio biases the decision tree against considering attributes with a large number of distinct values which might lead to overfitting.
Did you mean Information gain, as information gain is bias towards variables with large distinct values and information gain ratio is tries to solve this by taking into account the number of branches that would result before making the split, It corrects information gain by taking the intrinsic information of a split into account.
Answer for why information gain is biased towards variables with large distinct values
Please note that information gain (IG) is biased toward variables with large number of distinct values not variables that have observations with large values. Before describing the reason of this condition, lets review the definition of IG.
Information gain is the amount of information that's gained by knowing the value of the attribute, which is the entropy of the distribution before the split minus the entropy of the distribution after it. The largest information gain is equivalent to the smallest entropy.
In other words, a variable with the highest number of distinct values probability can divide data to smaller chunks. Also, we know that lower number of observations in each chunk reduces probability of variation occurrence.
Using ID variable in splitting data is a common example for this issue. Since each individual sample has their own distinct value, selecting ID features leads to many clusters with one sample and entropy of zero. Therefore, a decision tree that works with IG, selects the ID as the first separator attribute. Indeed, entropy will approach to zero by selecting the ID feature. However, we are not interested to such a feature. We are more interested to features that highly explain the variation of dependent variable.
Please refer to this discussion where this point was initially written.

Excel - How to split data into train and test sets that are equally distributed

I've got a data set (in Excel) that I'm going to import into SAS to undertake some modelling.
I've got a method for randomly splitting my excel dataset (using the =RAND() function), but is there a way (at the splitting stage) to ensure the distribution of the samples is even (other than to keep randomly splitting and testing the distribution until it becomes acceptable)?
Otherwise, if this is best performed in SAS, what is the most efficient approach for testing the sample randomness?
The dataset contains 35 variables, with a mixture of binary, continuous and categorical variables.
In SAS, you can just use proc surveyselect to do this.
proc surveyselect data=sashelp.cars out=cars_out outall samprate=0.7;
run;
data train test;
set cars_out;
if selected then output test;
else output train;
run;
If there is a particular variable[s] you want to make sure the Train and Test sets are balanced on, you can use either strata or control depending on exactly what sort of thing you're talking about. control will simply make an approximate attempt to even things by the control variables (it sorts by the control variable, then pulls every 3rd or whatever, so you get a sort of approximate balance; if you have 2+ control variables it snake-sorts, Asc. then Desc. etc. inside, but that reduces randomness).
If you use strata, it guarantees you the sample rate inside the strata - so if you did:
proc sort data=sashelp.cars out=cars;
by origin;
run;
proc surveyselect data=cars out=cars_out outall samprate=0.7;
strata origin;
run;
(and the final splitting data step is the same) then you'd get 70% of each separate origin pulled (which would end up being 70% of the total, of course).
Which you do depends on what you care about it being balanced by. The more things you do this with, the less balanced it is with everything else, so be cautious; it may be that a simple random sample is the best, especially if you have a good enough N.
If you don't have enough N, then you can use bootstrapping techniques, meaning you take a sample WITH replacement from that 70% and take maybe 100 of those samples, each with a higher N than your original. Then you do your test or whatever on each sample selected, and the variation in those results tells you how you're doing even if your N is not enough to do it in one pass.
This answer has nothing to do with Excel, but with sampling strategy.
First we must construct a criteria that the sample's measure's are "close enough" to the complete dataset.
Say we are interested in the mean and the standard deviation and that the complete population is a set of 10,000 values in column A
we calculate the mean and standard deviation of the complete dataset.
devise a "close enough" criteria for each measure
pick, say, 500 samples
calculate the measures for the sample.
if the measures are "close enough" we are done, otherwise pick another 500.
We need to be careful that the criteria are not too tight; otherwise we may loop forever.

Data mining for significant variables (numerical): Where to start?

I have a trading strategy on the foreign exchange market that I am attempting to improve upon.
I have a huge table (100k+ rows) that represent every possible trade in the market, the type of trade (buy or sell), the profit/loss after that trade closed, and 10 or so additional variables that represent various market measurements at the time of trade-opening.
I am trying to find out if any of these 10 variables are significantly related to the profits/losses.
For example, imagine that variable X ranges from 50 to -50.
The average value of X for a buy order is 25, and for a sell order is -25.
If most profitable buy orders have a value of X > 25, and most profitable sell orders have a value of X < -25 then I would consider the relationship of X-to-profit as significant.
I would like a good starting point for this. I have installed RapidMiner 5 in case someone can give me a specific recommendation for that.
A Decision Tree is perhaps the best place to begin.
The tree itself is a visual summary of feature importance ranking (or significant variables as phrased in the OP).
gives you a visual representation of the entire
classification/regression analysis (in the form of a binary tree),
which distinguishes it from any other analytical/statistical
technique that i am aware of;
decision tree algorithms require very little pre-processing on your data, no normalization, no rescaling, no conversion of discrete variables into integers (eg, Male/Female => 0/1); they can accept both categorical (discrete) and continuous variables, and many implementations can handle incomplete data (values missing from some of the rows in your data matrix); and
again, the tree itself is a visual summary of feature importance ranking
(ie, significant variables)--the most significant variable is the
root node, and is more significant than the two child nodes, which in
turn are more significant than their four combined children. "significance" here means the percent of variance explained (with respect to some response variable, aka 'target variable' or the thing
you are trying to predict). One proviso: from a visual inspection of
a decision tree you cannot distinguish variable significance from
among nodes of the same rank.
If you haven't used them before, here's how Decision Trees work: the algorithm will go through every variable (column) in your data and every value for each variable and split your data into two sub-sets based on each of those values. Which of these splits is actually chosen by the algorithm--i.e., what is the splitting criterion? The particular variable/value combination that "purifies" the data the most (i.e., maximizes the information gain) is chosen to split the data (that variable/value combination is usually indicated as the node's label). This simple heuristic is just performed recursively until the remaining data sub-sets are pure or further splitting doesn't increase the information gain.
What does this tell you about the "importance" of the variables in your data set? Well importance is indicated by proximity to the root node--i.e., hierarchical level or rank.
One suggestion: decision trees handle both categorical and discrete data usually without problem; however, in my experience, decision tree algorithms always perform better if the response variable (the variable you are trying to predict using all other variables) is discrete/categorical rather than continuous. It looks like yours is probably continuous, in which case in would consider discretizing it (unless doing so just causes the entire analysis to be meaningless). To do this, just bin your response variable values using parameters (bin size, bin number, and bin edges) meaningful w/r/t your problem domain--e.g., if your r/v is comprised of 'continuous values' from 1 to 100, you might sensibly bin them into 5 bins, 0-20, 21-40, 41-60, and so on.
For instance, from your Question, suppose one variable in your data is X and it has 5 values (10, 20, 25, 50, 100); suppose also that splitting your data on this variable with the third value (25) results in two nearly pure subsets--one low-value and one high-value. As long as this purity were higher than for the sub-sets obtained from splitting on the other values, the data would be split on that variable/value pair.
RapidMiner does indeed have a decision tree implementation, and it seems there are quite a few tutorials available on the Web (e.g., from YouTube, here and here). (Note, I have not used the decision tree module in R/M, nor have i used RapidMiner at all.)
The other set of techniques i would consider is usually grouped under the rubric Dimension Reduction. Feature Extraction and Feature Selection are two perhaps the most common terms after D/R. The most widely used is PCA, or principal-component analysis, which is based on an eigen-vector decomposition of the covariance matrix (derived from to your data matrix).
One direct result from this eigen-vector decomp is the fraction of variability in the data accounted for by each eigenvector. Just from this result, you can determine how many dimensions are required to explain, e.g., 95% of the variability in your data
If RapidMiner has PCA or another functionally similar dimension reduction technique, it's not obvious where to find it. I do know that RapidMiner has an R Extension, which of course let's you access R inside RapidMiner.R has plenty of PCA libraries (Packages). The ones i mention below are all available on CRAN, which means any of the PCA Packages there satisfy the minimum Package requirements for documentation and vignettes (code examples). I can recommend pcaPP (Robust PCA by Projection Pursuit).
In addition, i can recommend two excellent step-by-step tutorials on PCA. The first is from the NIST Engineering Statistics Handbook. The second is a tutorial for Independent Component Analysis (ICA) rather than PCA, but i mentioned it here because it's an excellent tutorial and the two techniques are used for the similar purposes.

Determining Note Durations based on Onset Locations

I have a question regarding how to determine the Duration of notes given their Onset Locations.
So for example, I have an array of amplitude values (containing short) and another array of the same size, that contains a 1 if a note onset is detected, and a 0 if not. So basically, the distance between each 1 will be used to determine the duration.
How can I do this? I know that I have to use the Sample Rate and other attributes of the audio data, but is there a particular formula that I can use?
Thank you!
So you are starting with a list of ONSETS, what you are really looking for is a list of OFFSETS.
There are many methods for onset detection (here is a paper on it) https://adamhess.github.io/Onset_Detection_Nov302011.pdf
many of the same methods can be applied to Offset Detection:
Since the onset is marked by an INCREASE in spectral content you can measure a decrease in Spectral content.
take a reasonable time window before and after your onset. (.25-.5s)
Chop up the window into smaller segments and take 50% overlapping Fourier transforms.
compute the difference between the fourier co-efficient between two successive windows decreases and only allow negative changes in SD.
multiple your results by -1.
pick the peaks off of the results
Voila, offsets.
(look at page 7 of the paper listed above for more detail about spectrial difference function, you can apply a modified (as above) version of it_
Well, if your samplerate in Hz is fs, then the time between two nodes is equal to
1/fs * <number of zeros between the two node-ones>
Very simple :-)
Regards

Fast(er) way of matching feature to database

I'm working on a project where I have a feature in an image described as a set of X & Y coordinates (5-10 points per feature) which are unique for this feature. I also have a database with thousands of features where each have the same type of descriptor. The result looks like this:
myFeature: (x1,y1), (x2,y2), (x3,y3)...
myDatabase: Feature1: (x1,y1), (x2,y2), (x3,y3)...
Feature2: (x1,y1), (x2,y2), (x3,y3)...
Feature3: (x1,y1), (x2,y2), (x3,y3)...
...
I want to find the best match of myFeature in the features in myDatabase.
What is the fastest way to match these features? Currently I am stepping though each feature in the database and comparing each individual point:
bestScore = 0
for each feature in myDatabase:
score = 0
for each point descriptor in MyFeature:
find minimum distance from the current point to the...
points describing the current feature in the database
if the distance < threshold:
there is a match to the current point in the target feature
score += 1
if score > bestScore:
save feature as new best match
This search works, but clearly it gets painfully slow on large databases. Does anyone know of a faster method to do this type of search, or at least if there is a way to quickly rule out features that clearly won't match the descriptor?
Create a bitset (an array of 1s and 0s) from each feature.
Create such a bitmask for your search criteria and then just use a bitwise and to compare the search mask to your features.
With this approach, you can shift most work to the routines responsible for saving the stuff. Also, creating the bitmasks should not be that computationally intensive.
If you just want to rule out features that absolutely can't match, then your mask-creation algorithm should take care of that and create the bitmasks a bit fuzzy.
The easiest way to create such masks is probably by creating a matrix as big as the matrix of your features and put a one in every coordinate that is set for the feature and a zero in every coordinate that isn't. Then turn that matrix into a one dimensional row. Compare the feature-row then to the search mask bitwise.
This is similar to the way bitmap indexes work on large databases (oracle e.g.), but with a different intention and without a full bitmap-image of all database rows in memory.
The power of this is in the bitwise comparisons.
On a 32bit machine you can perform 32 comparisons per instruction when you can just do one with integer numbers in a point comparison. It yields even higher boni for floating point operations, depending on the architecture.
This in general looks like a spatial index problem. It's not my field, but you'll probably need to build a sort of tree index, such as a quadtree, that you can use to easily search for features. You can find some links from this wikipedia article: http://en.wikipedia.org/wiki/Spatial_index
It might be a problem that you can easily implement in an existing spatial database. It's very GIS-like in its description.
One thing you can do is calculate a point of gravity for every feature and use that to whittle down the search space a bit (a one dimensional search is a lot easier to build an index for), but that has the downside of being just a heuristic (depending on the shapes of your feature, the point of gravity may end up in weird places).

Resources