Statistical mean centering - Using total mean or attribute mean - statistics

I have a set of data, over 1000 rows and 20 attributes ( shown in columns ).
I am wanting use mean centering, which includes taking the mean away from each value to give a mean of 0. Do I remove the mean on an attribute by attribute basis, or do I remove the mean of all attributes from each?
For example, if the mean of attribute A was 500, and the mean of attribute B was 1,000.
For all values in A I could remove 500, which gives the A attribute a mean of 0. Then I could do the same for attribute B.
OR
I could take 750 off all values for both attributes.
Which is more statistically correct?
My question is due to this:
If I subtract different values from the different attributes, the attributes are then no longer comparable as different amount have been taken from each. If I subtract the same value from all, then some columns may be full of just negative figures ( and so negating the effect of mean centering ).
Thanks,

Typically you would center each attribute individually.
If you center each attribute separately, you are assuming that for an individual, what matters is how each measure differs from the mean of that attribute, and you will lose absolute comparison of attributes for that individual.
For instance if you had person height, weight, centering them separately you could then ask "for a person taller than average, is the weight also larger than average weight". Averaging together height and weight would be meaningless.
One way to think about it is, you are creating an average individual, which you can now use as a benchmark against all your observations.
Now if the absolute value of 2 measures are comparable, say product price and cost, you wouldn't be able to compare them any longer, because they would be shifted. If what you care about is a measure that uses absolute comparisons for an individual observation, you would need to create an auxiliary metric, like for instance %profit. In that case, the centered values would allow you to ask "are products with higher prices more profitable than average".

Related

How to quantify magnitude of change in dataset containing base values of 0?

I have a dataset with the low-water and high-water surface area of lakes/ponds within a delta for each year. These lakes can undergo substantial change from year to year, and sometimes can dry out completely. As such, surface area can have values of 0 during the low-water period. I'm trying to quantify the magnitude of flooding in the spring on the surface areas of these lakes. Given the high inter annual variations in surface area, I need to compare the low-water value from the previous year to the high-water value of the following year to quantify this magnitude; comparing to a mean isn't sensitive enough. However, given the low water surface area of 0 for some lakes, I cannot quantify percent change.
My current idea is to do an "inverse" of percent change (don't know how else to describe it), where I divide the low-water value by the high-water value. This gives me a scale where large change will equal 0 and little change will equal 1. However, again small changes from a surface area of 0 will be over represented. Any idea how I could accurately compare the magnitude of flooding in such a case?

Correlation statistics

Naive Question:
In the attached snapshot, I am trying to figure out the correlation concept when applied to actual values and to calculation performed on those actual values and creating a new stream of data.
In the example,
Columns A,B,C,D,E have very different correlation but when I do a rolling sum on the same columns to get G,H,I,J,K the correlation is very much the same(negative or positive.
Are these to different types of correlation or am I missing out on something.
Thanks in advance!!
Yes, these are different correlations. It's similar to if you were to measure acceleration over time of 5 automobiles (your first piece of data) and correlate those accelerations. Each car accelerates at different rates over time leaving your correlation all over the place.
Your second set of data would be the velocity of each car at each point in time. Because each car is accelerating at a pretty constant rate (and doing so in two different directions from the starting point) you either get a big positive or big negative correlation.
It's not necessary that you get that big positive or big negative correlation in the second set, but since your data in each list is consistently positive or negative and grows at a consistent rate, it correlates with either similar lists.

Averaging many curves with different x and y values

I have several curves that contain many data points. The x-axis is time and let's say I have n curves with data points corresponding to times on the x-axis.
Is there a way to get an "average" of the n curves, despite the fact that the data points are located at different x-points?
I was thinking maybe something like using a histogram to bin the values, but I am not sure which code to start with that could accomplish something like this.
Can Excel or MATLAB do this?
I would also like to plot the standard deviation of the averaged curve.
One concern is: The distribution amongst the x-values is not uniform. There are many more values closer to t=0, but at t=5 (for example), the frequency of data points is much less.
Another concern. What happens if two values fall within 1 bin? I assume I would need the average of these values before calculating the averaged curve.
I hope this conveys what I would like to do.
Any ideas on what code I could use (MATLAB, EXCEL etc) to accomplish my goal?
Since your series' are not uniformly distributed, interpolating prior to computing the mean is one way to avoid biasing towards times where you have more frequent samples. Note that by definition, interpolation will likely reduce the range of your values, i.e. the interpolated points aren't likely to fall exactly at the times of your measured points. This has a greater effect on the extreme statistics (e.g. 5th and 95th percentiles) rather than the mean. If you plan on going this route, you'll need the interp1 and mean functions
An alternative is to do a weighted mean. This way you avoid truncating the range of your measured values. Assuming x is a vector of measured values and t is a vector of measurement times in seconds from some reference time then you can compute the weighted mean by:
timeStep = diff(t);
weightedMean = timeStep .* x(1:end-1) / sum(timeStep);
As mentioned in the comments above, a sample of your data would help a lot in suggesting the appropriate method for calculating the "average".

How to calculate the percentage of total area of features having specific attributes' values with Qgis?

I'm working in QGis with different layers covering the same geographical extent. Taking the intersection of those layers, I generated a new one whose attribute table contains all the attributes from the different layers. I would then like to know if there is a tool in Qgis that would allow me to generate the percentage areas covered by features corresponding to specific attributes' value. Would it be possible to compare for example the areas in percentages of the features caracterised by value A and Value B of Attibute 1 and 2 with the one of the features caracterised by value C and D of the same attributes?
Thank you very much for your help.
Regards,

Bootstrapping with Replacement

I'm reading a paper and am confused with their described Bootstrap Method. the text says:
the uncertainties associated with each stacked flux density are
obtained via the bootstrap method, during which random subsamples
(with replacement) of sources are chosen and re-stacked. The number of
sources in each subsample is equal to the original number of sources
in the stack. This process is repeated 10000 times in order to
determine the representative spread in the properties of the
population being stacked.
So, say I have 50 values. I find the average of these values. According to this method, I would get a subsample from this original 50 population and find that average, and repeat this 10,000 times. Now, how would I get a subsample "equal to the original number of sources in the stack" without my subsample BEING EXACTLY THE SAME AS THE ORIGINAL, AND THUS THE EXACT SAME MEAN, WHICH WOULD TELL US NOTHING!?
you can reuse values. So if i have ABCDE as my values, i can bootstrap with AABCD, etc. I can use values twice, that is the key

Resources