Median of Medians - median

I have read the order statistics to find the k-th smallest (or largest) element in an array of size n in linear time O(n).
There is one step that it needs to find the median of the medians.
Split the array into [n/5] parts. Each part has 5 elements.
Find the median in each part. (We have [n/5] numbers now)
Repeat step 1 and 2 until we only have the last number. (i.e. recursive)
T(n) = T(n/5) + O(n)
and we can get T(n) = O(n).
But, is it true that, the number we finally get is not the median of medians, but the median of medians of medians of medians of medians of medians, if we have a large array.
Please consider an array which has 125 elements.
First, it is split into 25 parts and we find 25 medians.
Then, we split these 25 numbers into 5 parts and find 5 medians,
Finally, we obtain the number which is median of medians of medians. (Not median of medians)
The reason why I care about it is that, I can understand there are at most about [3/4]*n elements that are smaller (or larger) than the median of medians. But what if it is not the median of medians but the median of medians of medians? In worse case there must be less elements that are smaller (or larger) than the pivot, which means the pivot is closer to the bound of the array.
If we have a VERY large array, and we found its median of medians of medians of medians of medians of medians. In the worst case the pivot we found can still be very close to the bound and what is the time complexity in this case?
I made up a dataset of 125 elements. Is the result 9?
0.8 0.9 1 inf inf
1.8 1.9 2 inf inf
6.8 6.9 7 inf inf
inf inf inf inf inf
inf inf inf inf inf
2.8 2.9 3 inf inf
3.8 3.9 4 inf inf
7.8 7.9 8 inf inf
inf inf inf inf inf
inf inf inf inf inf
4.8 4.9 5 inf inf
5.8 5.9 6 inf inf
8.8 8.9 9 inf inf
inf inf inf inf inf
inf inf inf inf inf
inf inf inf inf inf
inf inf inf inf inf
inf inf inf inf inf
inf inf inf inf inf
inf inf inf inf inf
inf inf inf inf inf
inf inf inf inf inf
inf inf inf inf inf
inf inf inf inf inf
inf inf inf inf inf
where inf means the number is large enough.

Let's denote your median of medians of medians of ... as [median of]* = M.
First, I believe that median of medians algorithm (to select a good pivot) is not recursive. The algorithm goes as follows:
Split the elements in the groups of 5
Find the median of each group
Find the median of medians and use it as a pivot.
Median of medians will be smaller than 3n/10 elements and larger than another 3n/10 elements, not 3n/4. You have n/5 numbers after selecting medians. Median of median is greater/smaller than half of those numbers, which is n/10. Each of those numbers is a median itself, so it's greater/smaller than 2 numbers, giving you another 2n/10 numbers. Now in total, you get n/10 + 2n/10 = 3n/10 numbers.
To address your second question, after collecting the group of 5's in your example dataset and calculating their medians, we will have the following sequence:
1, 2, 7, inf, inf
3, 4, 8, inf, inf
5, 6, 9, inf, inf,
inf, inf, inf, inf, inf,
inf, inf, inf, inf, inf.
So the median of medians would indeed be 9.
Your proposed [median of]* algorithm's runtime will be:
T(n) = O(n * log(n))
Now let's try to analyze how many numbers we have less/greater than M.
We have the following groups:
depth 1: n/5 elements all medians
depth 2: n/25 elements all medians
...
depth i: n/(5^i) elements all medians
Each group is less/greater than 2 elements of the previous depth, which is less/greater than 2 elements of the previous depth, and so on:
Calculating in total, we get that our M is greater/less than (n * (2^k) + k * n) /((2^k) * (5^k)). For depth = 1 you get median of medians, which is 3n/10.
Now assuming your depth is [log_5 (n)], i.e. n = 5^k, we get:
5^k * (k + 2^k)/(5^k * 2^k) which is -> 1.

Related

Sorting data from a large text file and convert them into an array

I have a text file that contain some data.
#this is a sample file
# data can be used for practice
total number = 5
t=1
dx= 10 10
dy= 10 10
dz= 10 10
1 0.1 0.2 0.3
2 0.3 0.4 0.1
3 0.5 0.6 0.9
4 0.9 0.7 0.6
5 0.4 0.2 0.1
t=2
dx= 10 10
dy= 10 10
dz= 10 10
1 0.11 0.25 0.32
2 0.31 0.44 0.12
3 0.51 0.63 0.92
4 0.92 0.72 0.63
5 0.43 0.21 0.14
t=3
dx= 10 10
dy= 10 10
dz= 10 10
1 0.21 0.15 0.32
2 0.41 0.34 0.12
3 0.21 0.43 0.92
4 0.12 0.62 0.63
5 0.33 0.51 0.14
My aim is to read the file, find out the row where column value is 1 and 5 and store them as multidimensional array. like for 1 it will be a1=[[0.1, 0.2, 0.3],[0.11, 0.25, 0.32],[0.21, 0.15, 0.32]] and for 5 it will be a5=[[0.4, 0.2, 0.1],[0.43, 0.21, 0.14],[0.33, 0.51, 0.14]].
Here is my code that I have written,
import numpy as np
with open("position.txt","r") as data:
lines = data.read().split(sep='\n')
a1 = []
a5 = []
for line in lines:
if(line.startswith('1')):
a1.append(list(map(float, line.split()[1:])))
elif (line.startswith('5')):
a5.append(list(map(float, line.split()[1:])))
a1=np.array(a1)
a5=np.array(a5)
My code is working perfectly with my sample file that I have uploaded but in real case my file is quite larger (2gb). Handling that with my code raise memory error. How can I solve this issue? I have 96GB in my workstation.
There are several things to improve:
Don't attempt to load the entire text file in memory (that will save 2 GB).
Use numpy arrays, not lists, for storing numerical data.
Use single-precision floats rather than double-precision.
So, you need to estimate how big your array will be. It looks like there may be 16 million records for 2 GB of input data. With 32-bit floats, you need 16e6*2*4=128 MB of memory. For a 500 GB input, it will fit in 33 GB memory (assuming you have the same 120-byte record size).
import numpy as np
nmax = int(20e+6) # take a bit of safety margin
a1 = np.zeros((nmax, 3), dtype=np.float32)
a5 = np.zeros((nmax, 3), dtype=np.float32)
n1 = n5 = 0
with open("position.txt","r") as data:
for line in data:
if '0' <= line[0] <= '9':
values = np.fromstring(line, dtype=np.float32, sep=' ')
if values[0] == 1:
a1[n1] = values[1:]
n1 += 1
elif values[0] == 5:
a5[n5] = values[1:]
n5 += 1
# trim (no memory is released)
a1 = a1[:n1]
a5 = a5[:n5]
Note that float equalities (==) are generally not recommended, but in the case of value[0]==1, we know that it's a small integer, for which float representations are exact.
If you want to economize on memory (for example if you want to run several python processes in parallel), then you could initialize the arrays as disk-mapped arrays, like this:
a1 = np.memmap('data_1.bin', dtype=np.float32, mode='w+', shape=(nmax, 3))
a5 = np.memmap('data_5.bin', dtype=np.float32, mode='w+', shape=(nmax, 3))
With memmap, the files won't contain any metadata on data type and array shape (or human-readable descriptions). I'd recommend that you convert the data to npz format in a separate job; don't run these jobs in parallel because they will load the entire array in memory.
n = 3
a1m = np.memmap('data_1.bin', dtype=np.float32, shape=(n, 3))
a5m = np.memmap('data_5.bin', dtype=np.float32, shape=(n, 3))
np.savez('data.npz', a1=a1m, a5=a5m, info='This is test data from SO')
You can load them like this:
data = np.load('data.npz')
a1 = data['a1']
Depending on the balance between cost of disk space, processing time, and memory, you could compress the data.
import zlib
zlib.Z_DEFAULT_COMPRESSION = 3 # faster for lower values
np.savez_compressed('data.npz', a1=a1m, a5=a5m, info='...')
If float32 has more precision than you need, you could truncate the binary representation for better compression.
If you like memory-mapped files, you can save in npy format:
np.save('data_1.npy', a1m)
a1 = np.load('data_1.npy', mmap_mode='r+')
But then you can't use compression and you'll end up with many metadata-less files (except array size and datatype).

Why do i get different precision, recall and f1 score for different methods of calculating the macro avearage

I calculated the macro-average of the P, R and F1 of my classification using two methods. Method 1 is
print("Macro-Average Precision:", metrics.precision_score(predictions, test_y, average='macro'))
print("Macro-Average Recall:", metrics.recall_score(predictions, test_y, average='macro'))
print("Macro-Average F1:", metrics.f1_score(predictions, test_y, average='macro'))
gave this result:
Macro-Average Precision: 0.6822
Macro-Average Recall: 0.7750
Macro-Average F1: 0.7094
Method 2 is:
print(classification_report(y_true, y_pred))
gave this result:
precision recall f1-score support
0 0.55 0.25 0.34 356
1 0.92 0.96 0.94 4793
2 0.85 0.83 0.84 1047
accuracy 0.90 6196
macro avg 0.78 0.68 0.71 6196
weighted avg 0.89 0.90 0.89 6196
I expected the output in both methods to be same, since they were generated at the same time at the same run.
Can someone explain why this happened or if there is a mistake somewhere?
As far as I can tell from the classification_report results, you have multiple classes.
If you check the documentation for the single functions in the metrics modules, the default parameters consider the class '1' as the default positive class.
I think what might be happening is that in your first calculation, it is a One versus all computation (0 and 2 are negative classes and 1 is the positive class). And in the second case, you are actually taking into account a true multi class situation.

Multiple linear regression with missing covariates

Imagine I have a dataset like
df <- data.frame(y=c(11:16), x1=c(23,NA,27,20,20,21), x2=c(NA,9,2,9,7,8))
df
y x1 x2
1 11 23 NA
2 12 NA 9
3 13 27 2
4 14 20 9
5 15 20 7
6 16 21 8
If I perform a multiple linear regression, I get
m <- lm(y~x1+x2, data=df)
summary(m)
Call:
lm(formula = y ~ x1 + x2, data = df)
Residuals:
3 4 5 6
-1.744e-01 -1.047e+00 -4.233e-16 1.221e+00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.72093 27.06244 0.729 0.599
x1 -0.24419 0.93927 -0.260 0.838
x2 0.02326 1.01703 0.023 0.985
Residual standard error: 1.617 on 1 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.4767, Adjusted R-squared: -0.5698
F-statistic: 0.4556 on 2 and 1 DF, p-value: 0.7234
Here we have 2 observations (1 and 2) deleted due to missingness.
To reduce the effects of missing data, would it be wise to compute 2 different simple linear regressions?
I.e.
m1 <- lm(y~x1, data=df)
m2 <- lm(y~x2, data=df)
In this case, for each model we will have only 1 observation deleted due to missingness.
No, that would probably not be wise.
Because you run into the issue of omitted variables bias.
You can see how this will affect your estimates, for instance for x1, which is inflated:
summary(lm(y~x1, data=df))
Call:
lm(formula = y ~ x1, data = df)
Residuals:
1 3 4 5 6
-2.5287 0.8276 -0.5460 0.4540 1.7931
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.3276 7.1901 2.966 0.0592 .
x1 -0.3391 0.3216 -1.054 0.3692
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.897 on 3 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.2703, Adjusted R-squared: 0.02713
F-statistic: 1.112 on 1 and 3 DF, p-value: 0.3692
Note that you're relation of interest is y~x1+x2, so the effect of x1 on y accounting for the effect of x2, and vice versa.
That is of course not the same as estimating y~x1 and y~x2separately, where you omit the effect of the other explanatory variable.
Now there are of course strategies to deal with missing values.
One option is estimating a Bayesian model, using JAGS for instance, where you can model the missing values. An example would be the following for instance, where I take the mean and standard deviation of each variable to model the missing values:
model{
for(i in 1:N){
y[i] ~ dnorm(yhat[i], tau)
yhat[i] <- a+ b1*x1[i] + b2*x2[i]
# Accounting for missing data
x1[i]~dnorm(22,3)
x2[i]~dnorm(7,1.3)
}
# Priors
b1~dnorm(0, .01)
b2~dnorm(0, .01)
# Hyperpriors
tau <- pow(sd, -2)
sd ~ dunif(0, 20)
}
This is just off the top of my head.
For better and more insightful advice on how deal with missing values I would recommend paying a visit to stats.stackexchange

How to build a scatter graph in excel with average y value for each x value

I am not sure that here is the best place to ask,
but I have summerized my program performance data in an excel file and I want to build a scatter graph.
For each x value I have 6 y values and I want my graph to contain the average of those 6 to each x.
Is there a way to do this in excel?
For example: I have
X Y
1 0.2
1 0
1 0
1 0.8
1 1.4
1 0
2 0.2
2 1.2
2 1
2 2.2
2 0
2 2.2
3 0.8
3 1.6
3 0
3 3.6
3 1.2
3 0.6
For each x I want my graph to contain the average y.
Thanks
Not certain what you want but suggest inserting a column (assumed to be B) immediately between your two existing ones and populating it with:
=AVERAGEIF(A:A,A2,C:C)
then plotting X against those values.
Or maybe better, just subtotal for each change in X with average for Y and plot that.

Effective reasonable indexing for numeric vector search?

I have a long numeric table where 7 columns are a key and 4 columns is a value to find.
Actually I have rendered an object with different distances and perspective angles and have calculated Hu moments for it's contour. But this is not important to the question, just a sample to imagine.
So, when I have 7 values, I need to scan a table, find closest values in that 7 columns and extract corresponding 4 values.
So, the task aspects to consider is follows:
1) numbers have errors
2) the scale in function domain is not the same as the scale in function value; i.e. the "distance" from point in 7-dimensional space should depend on that 4 values, how it affect
3) search should be fast
So the question is follows: isn't some algorithm out there to solve this task efficiently, i.e. perform some indexing on that 7 columns, but do this no like conventional databases do, but taking into account point above.
If I understand the problem correctly, you might consider using scipy.cluster.vq (vector quantization):
Suppose your 7 numeric columns look like this (let's call the array code_book):
import scipy.cluster.vq as vq
import scipy.spatial as spatial
import numpy as np
np.random.seed(2013)
np.set_printoptions(precision=2)
code_book = np.random.random((3,7))
print(code_book)
# [[ 0.68 0.96 0.27 0.6 0.63 0.24 0.7 ]
# [ 0.84 0.6 0.59 0.87 0.7 0.08 0.33]
# [ 0.08 0.17 0.67 0.43 0.52 0.79 0.11]]
Suppose the associated 4 columns of values looks like this:
values = np.arange(12).reshape(3,4)
print(values)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
And finally, suppose we have some "observations" of 7-column values like this:
observations = np.random.random((5,7))
print(observations)
# [[ 0.49 0.39 0.41 0.49 0.9 0.89 0.1 ]
# [ 0.27 0.96 0.16 0.17 0.72 0.43 0.64]
# [ 0.93 0.54 0.99 0.62 0.63 0.81 0.36]
# [ 0.17 0.45 0.84 0.02 0.95 0.51 0.26]
# [ 0.51 0.8 0.2 0.9 0.41 0.34 0.36]]
To find the 7-valued row in code_book which is closest to each observation, you could use vq.vq:
index, dist = vq.vq(observations, code_book)
print(index)
# [2 0 1 2 0]
The index values refer to rows in code_book. However, if the rows in values are ordered the same way as code_book, we can "lookup" the associated value with values[index]:
print(values[index])
# [[ 8 9 10 11]
# [ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]
# [ 0 1 2 3]]
The above assumes you have all your observations arranged in an array. Thus, to find all the indices you need only one call to vq.vq.
However, if you obtain the observations one at a time and need to find the closest row in code_book before going on to the next observation, then it would be inefficient to call vq.vq each time. Instead, generate a KDTree once, and then find the nearest neighbor(s) in the tree:
tree = spatial.KDTree(code_book)
for observation in observations:
distances, indices = tree.query(observation)
print(indices)
# 2
# 0
# 1
# 2
# 0
Note that the number of points in your code_book (N) must be large compared to the dimension of the data (e.g. N >> 2**7) for the KDTree to be fast compared to simple exhaustive search.
Using vq.vq or KDTree.query may or may not be faster than exhaustive search, depending on the size of your data (code_book and observations). To find out which is faster, be sure to benchmark these versus an exhaustive search using timeit.
i don't know if i understood well your question,but i will try giving an answer.
for each row K in the table compute the distance of your key from the key in that row:
( (X1-K1)^2 + (X2-K2)^2 + (X3-K3)^2 + (X4-K4)^2 + (X5-K5)^2 + (X6-K6)^2 + (X7-K7)^2 )^0.5
where {X1,X2,X3,X4,X5,X6,X7} is the key and {K1,K2,K3,K4,K5,K6,K7}is the key at row K
you could make one factor of the key more or less relevant of the others multiplying it while computing distance,for example you could replace (X1-K1)^2 in the formula above with 5*(X1-K1)^2 to make that more influent.
and store in a variable the distance ,in a second variable the row number
do the same with the following rows and if the new distance is lower then the one you stored then replace the distance and the row number.
when you have checked all the rows in your table the second variable you have used will show you the nearest row to the key
here is some pseudo-code:
int Row= 0
float Key[7] #suppose it is already filled with some values
float ClosestDistance= +infinity
int ClosestRow= 0
while Row<NumberOfRows{
NewDistance= Distance(Key,Table[Row][0:7])#suppose Distance is a function that outputs the distance and Table is the table you want to control Table[Row= NumberOfRows][Column= 7+4]
if NewDistance<ClosestDistance{
ClosestDistance= NewDistance
ClosestRow= Row}
increase row by 1}
ValueFound= Table[ClosestRow][7:11]#this should be the value you were looking for
i know it isn't fast but it is the best i could do,hope it helped.
P.S. i haven't considered measurement errors,i know.

Resources