How to get skewness and kurtosis using Julia - statistics

I'm working on pointcloud lidar data. I want to calculate skewness and kurtosis to distinguish road from ground. My current intensity column looks like this in plot Is there an implementation on another language? I read
First, the skewness of the point cloud is calculated. If it is
greater than zero, peaks dominate the point cloud distribution as
shown in Table 1. Thus, the highest value of the point cloud is re-
moved by classifying it as an object point. To separate all ground
and object points, these steps are iteratively executed while the
skewness of the point cloud is greater than zero.
UPDATE
I tried
the following code but its an infinite loop
I'm getting StatsBase.skewness(poi.intensity) around 0.125
data = rand(Gamma(7.5, 1.0), 75)
th=maximum(data)
classification=rand(2:3,length(data))
poi=Table(intensity=data,classification=classification)
while sk > 0
poi.classification[findall(poi.intensity .== th)] .= 7 #object
th = th .- 1
sk = skewness(poi.intensity[findall(poi.classification .== 2)])
print(sk)
end

Related

Weighted mean, sd and median - Size Weighted (Negative Numbers)

I need to calculate the weighted median, average, sd of PE funds' returns. I weighted the sample according to the amount of committed capital of a fund, but I should consider negative products to analyze underperforming funds. However, I'm not sure if I can use neg/zero values to derivate these statistic measures.
Wμ = Σ(w,x)/Σw --> the formula i consider for wgt. average
w = Fund's size
x = net IRR
(w,x) = Neg & Pos values.
How can I calculate those measures, including negative/zero values? I'm doing it in Excel
My standpoint is the Kaplan and Schoar's approach (Private Equity Performance: Returns, Persistence, and Capital Flows)
Any help on this matter is really appreciated!

understanding of result of logistic regression

let us suppose we have following data with binary response output(coupon)
annual spending is given in 1000th unit, my goal is to estimate whether if customer spend more then 2000 and has Simmons card, will also have coupon, first of all i have sorted data according to response data, i got following picture
at next stage i have calculated logit for each data, for those initially i choose following coefficient
B0 0.1
B1 0.1
B2 0.1
and i have calculated L according to the following formula
at next stage i have calculated e^L (which in excel can be done easily by exp function )
=EXP(D2)
after that i have calculated probability
=E2/(1+E2)
and finally using formula
i have calculated log likelihood function
then i have calculated sum and using solver i have calculated coefficient that minimize this sum( please pay attention that values are given in negative value) , but i have got all coefficient zero
i am wrong ? or does it means that i can'predict buying of coupon on the base of Annual spending and owning of Simmons card? thanks in advance
You can predict the buying of a coupon on the base of Annual spending (and knowing Simmons card doesn't help).
Admittedly I didn't solve it in Excel, but I suspect the problem might be that your optimization didn't converge (i.e., failed to reach the correct coefficients through the solving process) -- the correct coefficients are B0 = 5.63, B1 = -2.95, and B2 = 0. I found an online reference for the Excel logistic regression procedure at http://blog.excelmasterseries.com/2014/06/logistic-regression-performed-in-excel.html.
I ran the logistic regression myself and found that Annual spending is significant (at the 0.05 level) whereas Simmons card is not. Re-running the model with Simmons card removed yields the following equations:
L = 5.63 - 2.95 * Annual spending
P(1) = exp(L)/(1 + exp(L))
If P(1) > 0.5 => coupon = 1
Although the entropy Rsquare is low at 0.39 (and the number of data points is very low), the model is statistically significant.

Expand a set of numbers in excel [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have an excel spreadsheet with
x y
0 -1.5
100 1.6
200 0
300 -6.8
400 -19.8
500 -39.9
I want to find the values where x = 600 through 1500. I have tried making a graph and using the trend line and getting Polynomial 2, and it returns
y = -2.8857x2 + 12.686x - 11.7
R² = 0.999
So i plug this into my calculation using
=-2.8857*A110*A110+12686*A110-11.7
where A110 is the value 600, but it answers
6572736.3
I'm no math major, but in a trend of -6.8,-19.8,-39.9, the next number is not 6572736.3
Can someone please tell me how to figure out the equation so I can complete the series of numbers?
I concur with #mkingston (see output below**).
I'd add two points:
1) I find it is always a good idea to plot the original data and the regression equation before doing anything with the equation. In this case, plotting #mkingston's result gives:
... which shows that #mkingston's fitted results (shown by the lines) are, in fact, a good fit to the original data.
2) Extrapolation is always hazardous. If you already have a very good reason to believe that the underlying function is a quadratic of the form that we've fitted here, then the fit results below indicate the uncertainty in the parameters and hence can be used to estimate the uncertainty in the prediction (which may be quite substantial once you extrapolate to x = 1500). If, on the other hand, the quadratic equation that we've fitted is just a convenient shape that fits the data range that is available to us, then there are many alternative functions that could fit the available data roughly as well as this quadratic does, but would predict wildly different values for the range x = 600 to 1500. In this latter case, I'd descrbe any prediction at x = 600 as very uncertain and any prediction beyond that point as highly speculative, at best.
**The output I get from the Data | Data Analysis | Regression function of Excel 2007 is (after I've edited to change "X Variable" to "X" and "X Variable 2" to "X^2" for clarity):
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.999516468
R Square 0.99903317
Adjusted R Square 0.998388617
Standard Error 0.647338875
Observations 6
ANOVA
df SS MS F Significance F
Regression 2 1299.01619 649.5080952 1549.9625 3.00625E-05
Residual 3 1.257142857 0.419047619
Total 5 1300.273333
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercep -1.9 0.586700679 -3.238448611 0.047907326 -3.767143409 -0.032856591 -3.767143409 -0.032856591
X 0.069142857 0.005518676 12.52888554 0.00109613 0.051579968 0.086705 746 0.051579968 0.086705746
X^2 -0.000288571 1.05946E-05 -27.23767444 0.000108607 -0.000322288 -0.000254855 -0.000322288 -0.000254855

Statistics help for computer vision [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I am doing my graduation project in the field of computer vision, and i have only taken one course in statistics that discussed very basic concepts, and now i am facing more difficulty in rather advanced topics, so i need help (book, tutorial, course, ..etc) to grasp and review the basic ideas and concepts in statistics and then dive into the details (statistical details) used in computer vision.
You can calculate False Positives/False negatives, etc with this Confusion Matrix PyTorch example:
import torch
def confusion(prediction, truth):
""" Returns the confusion matrix for the values in the `prediction` and `truth`
tensors, i.e. the amount of positions where the values of `prediction`
and `truth` are
- 1 and 1 (True Positive)
- 1 and 0 (False Positive)
- 0 and 0 (True Negative)
- 0 and 1 (False Negative)
"""
confusion_vector = prediction / truth
# Element-wise division of the 2 tensors returns a new tensor which holds a
# unique value for each case:
# 1 where prediction and truth are 1 (True Positive)
# inf where prediction is 1 and truth is 0 (False Positive)
# nan where prediction and truth are 0 (True Negative)
# 0 where prediction is 0 and truth is 1 (False Negative)
true_positives = torch.sum(confusion_vector == 1).item()
false_positives = torch.sum(confusion_vector == float('inf')).item()
true_negatives = torch.sum(torch.isnan(confusion_vector)).item()
false_negatives = torch.sum(confusion_vector == 0).item()
return true_positives, false_positives, true_negatives, false_negatives
You could use nn.BCEWithLogitsLoss (remove the sigmoid therefore) and set the pos_weight > 1 to increase the recall. Or further optimize it with using Dice Coefficients to penalize the model for false positives, with something like:
def Dice(y_true, y_pred):
"""Returns Dice Similarity Coefficient for ground truth and predicted masks."""
#print(y_true.dtype)
#print(y_pred.dtype)
y_true = np.squeeze(y_true)/255
y_pred = np.squeeze(y_pred)/255
y_true.astype('bool')
y_pred.astype('bool')
intersection = np.logical_and(y_true, y_pred).sum()
return ((2. * intersection.sum()) + 1.) / (y_true.sum() + y_pred.sum() + 1.)
IOU Calculations Explained
Count true positives (TP)
Count false positives (FP)
Count false negatives (FN)
Intersection = TP
Union = TP + FP + FN
IOU = Intersection/Union
The left side is our ground truth, while the right side contains our predictions. The highlighted cells on the left side note which class we are looking at for statistics on the right side. The highlights on the right side note true positives in a cream color, false positives in orange, and false negatives in yellow (note that all others are true negatives — they are predicted as this individual class, and should not be based on the ground truth).
For Class 0, only the top row of the 4x4 matrix should be predicted as zeros. This is a rather simplified version of a real ground truth. In reality, the zeros could be anywhere in the matrix. On the right side, we see 1,0,0,0, meaning the first is a false negative, but the other three are true positives (aka 3 for Intersection as well). From there, we need to find anywhere else where zero was falsely predicted, and we note that happens once on the second row, and twice on the fourth row, for a total of three false positives.
To get the union, we add up TP (3), FP (3) and FN (1) to get seven. The IOU for this class, therefore, is 3/7.
If we do this for all the classes and average the IOUs, we get:
Mean IOU = [(3/7) + (2/6) + (3/4) + (1/6)] / 4 = 0.420
You will also want to learn how to pull the statistics for mAP (Mean Avg Precision):
https://www.youtube.com/watch?v=pM6DJ0ZZee0
https://towardsdatascience.com/breaking-down-mean-average-precision-map-ae462f623a52#1a59
https://medium.com/#hfdtsinghua/calculate-mean-average-precision-map-for-multi-label-classification-b082679d31be
Compute Covariance Matrixes
The variance of a variable describes how much the values are spread. The covariance is a measure that tells the amount of dependency between two variables.
A positive covariance means that the values of the first variable are large when values of the second variables are also large. A negative covariance means the opposite: large values from one variable are associated with small values of the other.
The covariance value depends on the scale of the variable so it is hard to analyze it. It is possible to use the correlation coefficient that is easier to interpret. The correlation coefficient is just the normalized covariance.
A positive covariance means that large values of one variable are associated with big values from the other (left). A negative covariance means that large values of one variable are associated with small values of the other one (right).
The covariance matrix is a matrix that summarises the variances and covariances of a set of vectors and it can tell a lot of things about your variables. The diagonal corresponds to the variance of each vector:
A matrix A and its matrix of covariance. The diagonal corresponds to the variance of each column vector. Let’s check with the formula of the variance:
With n the length of the vector, and x̄ the mean of the vector. For instance, the variance of the first column vector of A is:
This is the first cell of our covariance matrix. The second element on the diagonal corresponds of the variance of the second column vector from A and so on.
Note: the vectors extracted from the matrix A correspond to the columns of A.
The other cells correspond to the covariance between two column vectors from A. For instance, the covariance between the first and the third column is located in the covariance matrix as the column 1 and the row 3 (or the column 3 and the row 1):
The position in the covariance matrix. Column corresponds to the first variable and row to the second (or the opposite). The covariance between the first and the third column vector of A is the element in column 1 and row 3 (or the opposite = same value).
Let’s check that the covariance between the first and the third column vector of A is equal to -2.67. The formula of the covariance between two variables Xand Y is:
The variables X and Y are the first and the third column vectors in the last example. Let’s split this formula to be sure that it is crystal clear:
The sum symbol (Σ) means that we will iterate on the elements of the vectors. We will start with the first element (i=1) and calculate the first element of X minus the mean of the vector X:
Multiply the result with the first element of Y minus the mean of the vector Y:
Reiterate the process for each element of the vectors and calculate the sum of all results:
Divide by the number of elements in the vector.
EXAMPLE - Let’s start with the matrix A:
We will calculate the covariance between the first and the third column vectors:
and
Which is x̄=3, ȳ=4, and n=3 so we have:
Code example -
Using NumPy, the covariance matrix can be calculated with the function np.cov.
It is worth noting that if you want NumPy to use the columns as vectors, the parameter rowvar=False has to be used. Also, bias=True divides by n and not by n-1.
Let’s create the array first:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
A = np.array([[1, 3, 5], [5, 4, 1], [3, 8, 6]])
Now we will calculate the covariance with the NumPy function:
np.cov(A, rowvar=False, bias=True)
Finding the covariance matrix with the dot product
There is another way to compute the covariance matrix of A. You can center A around 0. The mean of the vector is subtracted from each element of the vector to have a vector with mean equal to 0. It is multiplied with its own transpose, and divided by the number of observations.
Let’s start with an implementation and then we’ll try to understand the link with the previous equation:
def calculateCovariance(X):
meanX = np.mean(X, axis = 0)
lenX = X.shape[0]
X = X - meanX
covariance = X.T.dot(X)/lenX
return covariance
print(calculateCovariance(A))
Output:
array([[ 2.66666667, 0.66666667, -2.66666667],
[ 0.66666667, 4.66666667, 2.33333333],
[-2.66666667, 2.33333333, 4.66666667]])
The dot product between two vectors can be expressed:
It is the sum of the products of each element of the vectors:
If we have a matrix A, the dot product between A and its transpose will give you a new matrix:
Visualize data and covariance matrices
In order to get more insights about the covariance matrix and how it can be useful, we will create a function to visualize it along with 2D data. You will be able to see the link between the covariance matrix and the data.
This function will calculate the covariance matrix as we have seen above. It will create two subplots — one for the covariance matrix and one for the data. The heatmap() function from Seaborn is used to create gradients of colour — small values will be coloured in light green and large values in dark blue. We chose one of our palette colours, but you may prefer other colours. The data is represented as a scatterplot.
def plotDataAndCov(data):
ACov = np.cov(data, rowvar=False, bias=True)
print 'Covariance matrix:\n', ACov
fig, ax = plt.subplots(nrows=1, ncols=2)
fig.set_size_inches(10, 10)
ax0 = plt.subplot(2, 2, 1)
# Choosing the colors
cmap = sns.color_palette("GnBu", 10)
sns.heatmap(ACov, cmap=cmap, vmin=0)
ax1 = plt.subplot(2, 2, 2)
# data can include the colors
if data.shape[1]==3:
c=data[:,2]
else:
c="#0A98BE"
ax1.scatter(data[:,0], data[:,1], c=c, s=40)
# Remove the top and right axes from the data plot
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
Uncorrelated data
Now that we have the plot function, we will generate some random data to visualize what the covariance matrix can tell us. We will start with some data drawn from a normal distribution with the NumPy function np.random.normal().
This function needs the mean, the standard deviation and the number of observations of the distribution as input. We will create two random variables of 300 observations with a standard deviation of 1. The first will have a mean of 1 and the second a mean of 2. If we randomly draw two sets of 300 observations from a normal distribution, both vectors will be uncorrelated.
np.random.seed(1234)
a1 = np.random.normal(2, 1, 300)
a2 = np.random.normal(1, 1, 300)
A = np.array([a1, a2]).T
A.shape
Note 1: We transpose the data with .T because the original shape is (2, 300) and we want the number of observations as rows (so with shape (300, 2)).
Note 2: We use np.random.seed function for reproducibility. The same random number will be used the next time we run the cell. Let’s check how the data looks like:
A[:10,:]
array([[ 2.47143516, 1.52704645],
[ 0.80902431, 1.7111124 ],
[ 3.43270697, 0.78245452],
[ 1.6873481 , 3.63779121],
[ 1.27941127, -0.74213763],
[ 2.88716294, 0.90556519],
[ 2.85958841, 2.43118375],
[ 1.3634765 , 1.59275845],
[ 2.01569637, 1.1702969 ],
[-0.24268495, -0.75170595]])
Nice, we have two column vectors; Now, we can check that the distributions are normal:
sns.distplot(A[:,0], color="#53BB04")
sns.distplot(A[:,1], color="#0A98BE")
plt.show()
plt.close()
We can see that the distributions have equivalent standard deviations but different means (1 and 2). So that’s exactly what we have asked for.
Now we can plot our dataset and its covariance matrix with our function:
plotDataAndCov(A)
plt.show()
plt.close()
Covariance matrix:
[[ 0.95171641 -0.0447816 ]
[-0.0447816 0.87959853]]
We can see on the scatterplot that the two dimensions are uncorrelated. Note that we have one dimension with a mean of 1 (y-axis) and the other with the mean of 2 (x-axis).
Also, the covariance matrix shows that the variance of each variable is very large (around 1) and the covariance of columns 1 and 2 is very small (around 0). Since we ensured that the two vectors are independent this is coherent. The opposite is not necessarily true: a covariance of 0 doesn’t guarantee independence.
Correlated data
Now, let’s construct dependent data by specifying one column from the other one.
np.random.seed(1234)
b1 = np.random.normal(3, 1, 300)
b2 = b1 + np.random.normal(7, 1, 300)/2.
B = np.array([b1, b2]).T
plotDataAndCov(B)
plt.show()
plt.close()
Covariance matrix:
[[ 0.95171641 0.92932561]
[ 0.92932561 1.12683445]]
The correlation between the two dimensions is visible on the scatter plot. We can see that a line could be drawn and used to predict y from x and vice versa. The covariance matrix is not diagonal (there are non-zero cells outside of the diagonal). That means that the covariance between dimensions is non-zero.
From this point with Covariance Matrcies, you can research further on the following:
Mean normalization
Standardization or normalization
Whitening
Zero-centering
Decorrelate
Rescaling

How to calculate mean and standard deviation for hue values from 0 to 360?

Suppose 5 samples of hue are taken using a simple HSV model for color, having values 355, 5, 5, 5, 5, all a hue of red and "next" to each other as far as perception is concerned. But the simple average is 75 which is far away from 0 or 360, close to a yellow-green.
What is a better way to calculate this mean and associated std?
The simple solution is to convert those angles to a set of vectors, from polar coordinates into cartesian coordinates.
Since you are working with colors, think of this as a conversion into the (a*,b*) plane. Then take the mean of those coordinates, and then revert back into polar form again. Done in matlab,
theta = [355,5,5,5,5];
x = cosd(theta); % cosine in terms of degrees
y = sind(theta); % sine with a degree argument
Now, take the mean of x and y, compute the angle, then
convert back from radians to degrees.
meanangle = atan2(mean(y),mean(x))*180/pi
meanangle =
3.0049
Of course, this solution is valid only for the mean angle. As you can see, it yields a consistent result with the mean of the angles directly, where I recognize that 355 degrees really wraps to -5 degrees.
mean([-5 5 5 5 5])
ans =
3
To compute the standard deviation, it is simplest to do it as
std([-5 5 5 5 5])
ans =
4.4721
Yes, that requires me to do the wrap explicitly.
I think the method proposed by user85109 is a good way to compute the mean, but not the standard deviation:
imagine to have three angles: 180, 180, 181
the mean would be correctly computed, as a number aproximately equal to 180
but from [180,180,-179] you would compute a high variance when in fact it is near zero
At first glance, I would compute separately the means and variances for the half positive angles , [0 to 180] and fot the negative ones [0,-180] and later I would compute the combined variance
https://www.emathzone.com/tutorials/basic-statistics/combined-variance.html
taking into account that the global mean and the difference between it and the local means has to be computed in both directions: clockwise and counterclockwise, and the the correct one has to be chosen.

Resources