Pearson correlation in python data.corr() - python-3.x

I have a matrix with the following shape (20, 17) with rows being the time and columns the number of variables.
When i compute the correlation matrix using data.corr() naturally i get a (17 , 17) matrix.
My questions:
Is there a way to normalise the variables directly within the .corr() function? (i know i can do that before hand and then apply the function)
My correlation matrix is large and I have trouble viewing everything in one go (i have to scroll down to do the necessary comparison). Is there a way to present the results in a concise way (like a heat map) where i can easily spot the highest from the lowest correlation?
Many Thanks

You could use matplotlib's imshow() to see a heatmap of any matrix.
Also, consider using pandas dataframes, this way you could sort by correlation strength and keep the labels of each row and col.

Related

Normalisation or Standardisation for detecting outlier?

When to use min max scaling that is normalisation and when to use standardisation that is using z score for data pre-processing ?
I know that normalisation brings down the range of feature to 0 to 1, and z score bring downs to -3 to 3, but am unsure when to use one of the two technique for detecting the outliers in data?
Let us briefly agree on the terms:
The z-score tells us how many standard deviations a given element of a sample is away from the mean.
The min-max scaling is the method of rescaling a range of measurements the interval [0, 1].
By those definitions, z-score usually spans an interval much larger than [-3,3] if your data follows a long-tailed distribution. On the other hand, a plain normalization does indeed limit the range of the possible outcomes, but will not help you help you to find outliers, since it just bounds the data.
What you need for outlier dedetction are thresholds above or below which you consider a data point to be an outlier. Many programming languages offer Violin plots or Box plots which nicely show your data distribution. The methods behind plots implement a common choice of thresholds:
Box and whisker [of the box plot] plots quartiles, and the band inside the box is always the second quartile (the median). But the ends of the whiskers can represent several possible alternative values, among them:
the minimum and maximum of all of the data [...]
one standard deviation above and below the mean of the data
the 9th percentile and the 91st percentile
the 2nd percentile and the 98th percentile.
All data points outside the whiskers of the box plots are plotted as points and considered outliers.

Excel Interpolate with logarithmic prediction

Is there a function within Excel to Interpolate while taking into account a logarithmic prediction?
At the moment I am using linear interpolation but would like to find a better way to fill in the blanks if possible.
There's no logarithmic regression or interpolation in Excel, even in the Anlaysis ToolPak. You'll need much more advanced software for that, such as MatLab.
If you're stuck working in Excel... here's a possible mathematical solution:
Rather than working with the raw data x and y, instead try plotting x and a^y, where a is the base. (Or plotting log(x,a) against y.) If you have the correct base a (and there's no vertical offset), you will then have a linear relationship from which you can perform a linear interpolation as normal, then convert the interpolated values back to actual values by taking the log of them.
If you don't know what a is, then you can instead calculate a line of best fit for an arbitrary a, calculate the standard residuals, and then use Problem Solver to modify a until you get the lowest possible standard residuals, at which point you have the best estimate of a.
Similarly if there is a vertical offset b, you'll need to test some variables there that also result in a linear relationship. Plot x against a^(y-b)

Averaging many curves with different x and y values

I have several curves that contain many data points. The x-axis is time and let's say I have n curves with data points corresponding to times on the x-axis.
Is there a way to get an "average" of the n curves, despite the fact that the data points are located at different x-points?
I was thinking maybe something like using a histogram to bin the values, but I am not sure which code to start with that could accomplish something like this.
Can Excel or MATLAB do this?
I would also like to plot the standard deviation of the averaged curve.
One concern is: The distribution amongst the x-values is not uniform. There are many more values closer to t=0, but at t=5 (for example), the frequency of data points is much less.
Another concern. What happens if two values fall within 1 bin? I assume I would need the average of these values before calculating the averaged curve.
I hope this conveys what I would like to do.
Any ideas on what code I could use (MATLAB, EXCEL etc) to accomplish my goal?
Since your series' are not uniformly distributed, interpolating prior to computing the mean is one way to avoid biasing towards times where you have more frequent samples. Note that by definition, interpolation will likely reduce the range of your values, i.e. the interpolated points aren't likely to fall exactly at the times of your measured points. This has a greater effect on the extreme statistics (e.g. 5th and 95th percentiles) rather than the mean. If you plan on going this route, you'll need the interp1 and mean functions
An alternative is to do a weighted mean. This way you avoid truncating the range of your measured values. Assuming x is a vector of measured values and t is a vector of measurement times in seconds from some reference time then you can compute the weighted mean by:
timeStep = diff(t);
weightedMean = timeStep .* x(1:end-1) / sum(timeStep);
As mentioned in the comments above, a sample of your data would help a lot in suggesting the appropriate method for calculating the "average".

Excel Trend line (SLOPE() ) and CORREL() yields different coefficients

I'm trying to use excel to get the coefficient for two financial market spreads using two methods on data series Sprd1 against data series Sprd2:
1) I used scatter plot and simply added a trend line, showing R^2 (0.4052) and Coefficient (0.614). Trend line should be using SLOPE() to get the coefficient...
2) I used =CORREL(Sprd1, Sprd2), showing 0.637; =RSQ(Sprd1, Sprd2), yielding 0.4052.
I understand that the R-sq values should be pretty close. But why would the coefficents differ? I'm trying to look for any difference in terms of excel's embedded methods or assumptions on the trendline and the CORREL.
Thank you very much!
While both RSQ and CORREL work from the same equation
the value returned by RSQ is the square of that result.
i.e. RSQ()=CORREL()^2
SLOPE, on the other hand, does not use (y-MEAN(y))^2, nor does it take a square root of the denominator:
so will give slightly different results, depending on the mean of y

How to use Excel column chart for datasets that have very different scales

There are 2 datasets that have values in the interval [0; 1]. I need to visualize these 2 datasets in Excel as a column chart. The problem is that some data points have values 0.0001, 0.0002, and other data point have values 0.8, 0.9, etc. So, the difference is hugde, and therefore it´s impossible to see data points with small values. What could be the solution? Should I use logarithmic scale? I appreciate any example.
Two possible ways below
Graph the smaller data set as a second series against a right hand Y axis (with same ratio from min to max as left hand series)
Multiply the smaller data set by 1000 and compare the multiplied data set to the larger one
Note that a log scale will give negative results given you are working with fractions, so that isn't really an option

Resources