Fast token overlap between strings - string

I have two sets of tokenised sentences A and B and I want to calculate the overlap between them in terms of common tokens. For example, the overlap between two individual sentences a1 today is a good day and b1 today I went to a park is 2 (today and a). I need a simple string matching method, without fuzzy or advanced methods. So the result is a matrix between all sentences in A and B with an overlap count for each pair.
The problem is that, while trivial, it is a quadratic operation (size of A x size of B pair-wise comparisons). With large data, the computation gets very slow very quickly. What would be a smart way of computing this avoiding pair-wise comparisons or doing them very fast? Are there packages/data structures particularly good for this?

Related

How can I obtain hourly readings from 24 hour moving average data?

I have an excel dataset of 24-hour moving averages for PM10 air pollution concentration levels, and need to obtain the individual hourly readings from them. The moving average data is updated every hour, so at hour t, the reading is the average of the 24 readings from t-23 to t hours, and at hour t+1, the reading is the average of t-22 to t+1, etc. I do not have any known data points to extrapolate from, just the 24-hour moving averages.
Is there any way I can obtain the individual hourly readings for time t, t+1, etc, from the moving average?
The dataset contains data over 3 years, so with 24 readings a day (at every hour), the dataset has thousands of readings.
I have tried searching for a possible way to implement a simple excel VBA code to do this, but come up empty. Most of the posts I have seen on Stackoverflow and stackexchange, or other forums, involve calculating moving averages from discrete data, which is the reverse of what I need to do here.
The few I have seen involve using matrices, which I am not very sure how to implement.
(https://stats.stackexchange.com/questions/67907/extract-data-points-from-moving-average)
(https://stats.stackexchange.com/questions/112502/estimating-original-series-from-their-moving-average)
Any suggestions would be greatly appreciated!
Short answer: you can't.
Consider a moving average on 3 points. And even consider we multiply each MA term by 3, so we really have sums of consecutive
Data: a b c d e f g
MA a+b+c
b+c+d
c+d+e
d+e+f
e+f+g
With initial values, you can do something. To find the value of d, you would need to know b+c, hance to know a (since a+b+c is known). Then to find e, you know c+d+e and d, so you must find c, and since a is already needed, you will need also need b.
More generally, for a MA of length n, if you know the first n-1 values (hence also the nth, since you know the sum), then you can find all subsequent values. You can also start from the end. But basically, if you don't have enough original data, you are lost: there is a 1-1 relation between the n-1 first values of your data and the possible MA series. If you don't have enough information, there are infinitely many possibilities, and you can't decide which one is right.
Here I consider the simplest MA where the coefficient of each variable is 1/n (hence you compute the sum and divide by n). But this would apply to any MA, with slightly more complexity to account for different coefficients for each term in the sum.

Find the 10 smallest numbers in a large dataset?

I am coding specifically in python, but right now in the phase of designing psuedocode for an algorithm that will take all the data points in a data set that has n values, n being very large and pick out the 10 smallest values (or finite number m << n, where m is the m smallest numbers). I wish to have an optimally efficient algorithm for the requirements.
My idea:
1) Heapsort the data then pick the smallest 10 values. O(nlog(n))
2) Alternatively,use a loop to identify a 'champion' that runs 10 times. With the first 'champion' determined remove from the dataset and then repeat this loop. O(n) (given m is small)
Which suggestion or if there is another would be best?
One approach among many possible:
Grab 10 values and sort them. Now compare the largest with the 11th through nth values one at a time. Whenever the new value is smaller replace the 10th smallest with it and resort your 10 values.
The list of 10 values, sorting them etc will all be in cache so fast even with rough code. The whole list will be accessed once through so will be fast as well.

Comparing count data with lots of zeroes

I'm not one to search for the most tenuous significant difference I can find, but hear me out.
I have some count data with four groups (3 of these can be combined to one, if necessary), groups A, B, C, and X.
Looking at the means and interval plots, X is clearly greater than the others (in terms of mean value), yet I cannot find any statistical test to back this up. This is, I believe, somewhat due to a high variability within groups and the large number of zero values.
I have tried normalized, removing zeroes, parametric, non-parametric, and more, with no success!
Any advice would be greatly appreciated as to how to approach this.
Many thanks.
The link below has the raw data. Groups A, B, and C can be combined into one group if it is relevant.
https://drive.google.com/open?id=0B6iQ6-J6e2TeU25Rd2hsd0Uxd2c

Can you estimate percentiles in unordered data?

Suppose you have a very large list of numbers which would be expensive to sort. They are real numbers/decimals but all lie in the same range, say 0 to n for some integer n. Are there any methods for estimating percentiles that don't require sorting the data i.e. an algorithm that has better complexity than the fastest sorting algorithm.
Note: The tag is quantiles only because there is no existing tag for percentiles and it wouldn't let me create one; my question is not specific to quantiles.
In order to find the p-th percentile of a set of N numbers, essentially you are trying to find the k-th largest number where k = N*p/100 (rounded down, I think--or on second thought, thinking of the median, for example, maybe it's rounded up).
You might try the median of medians algorithm, which is supposed to be able to find the k-th largest number among N numbers in O(N) time.
I don't know where this is implemented in a standard library but a proposed implementation
was posted in one of the answers to this question.

Compute statistical significance with Excel

I have 2 columns and multiple rows of data in excel. Each column represents an algorithm and the values in rows are the results of these algorithms with different parameters. I want to make statistical significance test of these two algorithms with excel. Can anyone suggest a function?
As a result, it will be nice to state something like "Algorithm A performs 8% better than Algorithm B with .9 probability (or 95% confidence interval)"
The wikipedia article explains accurately what I need:
http://en.wikipedia.org/wiki/Statistical_significance
It seems like a very easy task but I failed to find a scientific measurement function.
Any advice over a built-in function of excel or function snippets are appreciated.
Thanks..
Edit:
After tharkun's comments, I realized I should clarify some points:
The results are merely real numbers between 1-100 (they are percentage values). As each row represents a different parameter, values in a row represents an algorithm's result for this parameter. The results do not depend on each other.
When I take average of all values for Algorithm A and Algorithm B, I see that the mean of all results that Algorithm A produced are 10% higher than Algorithm B's. But I don't know if this is statistically significant or not. In other words, maybe for one parameter Algorithm A scored 100 percent higher than Algorithm B and for the rest Algorithm B has higher scores but just because of this one result, the difference in average is 10%.
And I want to do this calculation using just excel.
Thanks for the clarification. In that case you want to do an independent sample T-Test. Meaning you want to compare the means of two independent data sets.
Excel has a function TTEST, that's what you need.
For your example you should probably use two tails and type 2.
The formula will output a probability value known as probability of alpha error. This is the error which you would make if you assumed the two datasets are different but they aren't. The lower the alpha error probability the higher the chance your sets are different.
You should only accept the difference of the two datasets if the value is lower than 0.01 (1%) or for critical outcomes even 0.001 or lower. You should also know that in the t-test needs at least around 30 values per dataset to be reliable enough and that the type 2 test assumes equal variances of the two datasets. If equal variances are not given, you should use the type 3 test.
http://depts.alverno.edu/nsmt/stats.htm

Resources