custom number format in excel for disk or memory sizes - excel

I'm trying to draw a graph where the y-axis is disk sizes.
And I have sizes ranging from 2 kilobytes through about 22 petabytes.
Represented as numbers that is ~2000 to 22e12
This looks pretty bad on a chart axis.
I could set the scale to "thousands" and then I'd be left with numbers between 2 and 22e9 and the reader is left to do the math that 22e9 (thousand) bytes is 22 petabytes and stuff like that.
But that's not intuitive.
So I tried a custom format.
I know that I can do
[Red][>1000000000];[Blue][>1000000]
but only two can be provided in this way.
I also know that I can do stuff for positive, negative and zero as well.
But is there a way in which I can accomplish the following:
(a) cell values are numbers, sizes in bytes, kilobytes or some such unit
(b) graph shows y axis with these numbers
(c) y-axis is logarithmic (very important)
(d) the y-axis labels are converted to K, M, G or P bytes as appropriate
If you think you have a solution, please verify it with this sample data:
1990, 2050
1992, 21246
1993, 208557
1996, 20971520
2000, 306184192
2012, 1.75922E+14
Your graph should be an X-Y Scatter (with lines)
Your graph should include the numbers in the first column as the x-axis on a linear scale
Your graph should include the numbers in the second column as the y-axis on a logarithmic scale
Your graph should have y-axis legends like "1K", "10K", "100K", "1M", "10M", "100M", ... "1P" and so on at the appropriate points.
This same solution would also be obviously applicable for money, where you want to show numbers in thousands, millions or billions with the appropriate suffix and a small number.

Try this to convert a string value in the form 99.9G to 99.9E^9 value
=CHOOSE(SEARCH(RIGHT(B5),"kMG"), 10^3,10^6,10^9)*VALUE(LEFT(B5,LEN(B5)-1))

Related

Understanding Density Plots from Pandas DataFrames

I am trying to understand the distribution of my data for a particular column. It has close to ~1 Million records.
Here is the code that I have written to see the density plot.
df[ "ratio"].plot.kde(bw_method=0.1) # Plot continuous column
https://wellsr.com/python/python-pandas-density-plot-from-a-dataframe/
Here is the plot that I get:
I am not clear what does x-axis and y-axis indicate?
Is x-axis the ratio values from dataframe?
What does Density means in y-axis and how it is calculated?
Do we have any such formula to derive this values in y-axis? I am more interested in deriving the values. Given the column ratio how can we come up with density values. Can someone quickly show the maths?
If you are plotting a KDE, it means you are plotting a Probabilistic Density Function (PDF) of a random variable.
The X-Axis will be the range of values of the parameter you are plotting for. In your case, since you are plotting for Ratio, X-Axis will represent the range of values of your parameter ratio
Y-Axis on the other hand represents kernel density i.e the probability of the parameter your are plotting for.
Read the documentation

Excel - Plot average of two plots with inconsistent time (X) axis

I have managed to plot two different data sets on the same axis however, I'm also looking to plotting another line showing their average.
The main problem is that both data sets have different X (time) values so it's not possible to add an average column at the end and plot that. (See the highlighted row 22 for example, corresponding Time values are different)
Is there any way I can plot an average of two plots on the same axis?
One idea that might work is to place the values of both series, one above the other in two new columns, sort this new data according to time, smooth it, then plot the smoothed combined data. Alternatively, you could do the smoothing by simply plotting the new sorted series, adding a moving average trendline to it, then change the formatting of the new series so that it is no longer visible (but the trendline is). Something like this:
In the above picture, series 3 is the plot of the sorted aggregate data of series 1 and 2. If you change the formatting of series 3 so that there is no line, you get something like this:
For my relatively small mock data sets, the results are admittedly poor (it was based on just 25 data points in each series), but if you have a large amount of closely spaced data, and you play around with the moving average window size, you might get something acceptable. If not, you should probably just interpolate both datasets to obtain two consistent time series.

Computing average grid size

I am trying to compute the average cell size on the following set of points, as seen on the picture: . The picture was generated using gnuplot:
gnuplot> plot "debug.dat" using 1:2
The points are almost aligned on a rectangular grid, but not quite. There seems to be a bias (jitter?) of say 10-15% along either X or Y. How would one compute efficiently a proper partition in tiles so that there is virtually only one point per tile, size would be expressed as (tilex, tiley). I use the word virtually since the 10-15% bias may have moved a point in another adjacent tile.
Just for reference, I have manually sorted (hopefully correct) and extracted the first 10 points:
-133920,33480
-132480,33476
-131044,33472
-129602,33467
-128162,33463
-139679,34576
-138239,34572
-136799,34568
-135359,34564
-133925,34562
Just for clarification, a valid tile as per the above description would be (1435,1060), but I am really looking for a quick automated way.
Let's do this for X coordinate only:
1) sort the X coordinates
2) look at deltas between two subsequent X coordinates. These delta will fall into two categories - either they correspond to spaces between two columns, or to spaces between crosses within the same column. Your goal is to find a threshold that will separate the long spaces from the short ones. This can be done by finding a threshold that separates the deltas into two groups whose means are the furthest apart (I think)
3) once you have the threshold, separate points into columns. A columns starts and ends with a delta corresponding to the threshold you measured previously
4) calculate average position of each detected column
5) take deltas between subsequent columns. Now, the problem is that you may get a stray point that would break your columns. Use a median to get the strays out.
6) You should have a robust estimate of your gridX
Example, using your data, looking at axis X:
-133920 -132480 -131044 -129602 -128162 -139679 -138239 -136799 -135359 -133925
Sorted + deltas:
5 1434 1436 1440 1440 1440 1440 1440 1442
Here you can see that there is a very obvious threshold between small (5) and large (1434 and up) delta. 1434 will define your space here
Split the points into columns:
-139679|-138239|-136799|-135359|-133925 -133920|-132480|-131044|-129602|-128162
1440 1440 1440 1434 5 1440 1436 1442 1440
Almost all points are alone, except the two -133925 -133920.
The average grid line positions are:
-139679 -138239 -136799 -135359 -133922.5 -132480 -131044 -129602 -128162
Sorted deltas:
1436.0 1436.5 1440.0 1440.0 1440.0 1440.0 1442.0 1442.5
Median:
1440
Which is the correct answer for your SMALL data set, IMHO.

Excel graphing and axis

I have a scatterplot with values that range from 2 to -2. The catch is that 1 is the "zero-point". In other words the minimum positive value is 1.01 and the minimum negative value is -1.01. How can I edit the axis of the graph so that 0 is replaced with 1.
If you have a version of Excel later than 2007 I'd suggest splitting the positives and negatives into separate series and plotting one on a secondary axis (not that I know whether or not that would work!), but with 2007 I have not been able to place one vertical axis above the horizontal and another below. Instead the best I could manage was to use two separate charts:
by again splitting up the series, careful positioning and judicious use of a text box for 0.
At least this way you are not constrained by the outer limits.
Based on some very specific conditions you can print the zero-point as 1 using a custom number format: You have to set the axis options to be fixed at -2 (minimum) and 2 (maximum) with a major unit of 2 as well. This ensure that you only have the three values -2, 0 and 2 on the vertical/y-axis. Why is this important? Well, custom number formats can easily distinguish between positive/negative and zero values which is exactly what you have when you have -2, 0 and 2.
Here's a visual of the input/output:
The custom number format is set to 2;-2;1, thereby formatting all positive numbers to 2, all negative numbers to -2 and zero to 1.
If all you want is to replace the axis label "0" with "1" (as in the answer by Werner), then you can use the following (similar to this):
Add X and Y values for a dummy series, with 3 points. If the minimum value in your X-axis is xm, your points are (xm, -2), (xm, 0), (xm, 2).
Add cells with the 3 labels that you will use for the dummy series: "-2", "1", "2".
Go to the chart, and remove the tick labels of the Y-axis.
Add a series with the 3 dummy data points.
Add the labels to the data points. You can use references to the cells of item 2, or enter explicit labels. Entering each label (either a reference or an explicit label) is tedious when you have many data points. Check this, and in particular Rob Bovey´s add-in. It is excellent.
Format the dummy series so it is visually ok (e.g., small, hairline crosses, no line).
You can use variations on this. For instance, you can add extra points to your dummy series, with corresponding labels. Gridlines would match the dummy series.
But I think this is not appropriate, as the locations of your data points will be inconsistent with the scales.
What is appropriate is having an interrupted axis, where the interval (-1,1) is eliminated.
The answer by pnuts aims at that.
I propose something different, with the advantage of using only one chart:
Create a column where you add 2, only to negative Y-values. Use that column as your new Y-values.
Use the same trick as above, with your dummy series now being (xm, 0), (xm, 1), (xm, 2), and the labels the same as above.
You can use additional points in your dummy series.
You can use this technique to create an arbitrary number of axis interruptions. The formula for the "fake" Y-values would be more complicated, with IFs to detect the interval corresponding to each point, and suitable linear transformations to account for the change in scale for each interval (assuming linear scales; no mixing linear-log). But that is all.
PS: see also the links below. I still think my alternative is better.
http://peltiertech.com/broken-y-axis-in-excel-chart/
http://ksrowell.com/blog-visualizing-data/2013/08/12/how-to-simulate-a-broken-axis-value-axis/
http://www.tushar-mehta.com/excel/newsgroups/broken_y_axis/tutorial/index.html#Rescale%20and%20hide%20the%20y-axis

How to use Excel column chart for datasets that have very different scales

There are 2 datasets that have values in the interval [0; 1]. I need to visualize these 2 datasets in Excel as a column chart. The problem is that some data points have values 0.0001, 0.0002, and other data point have values 0.8, 0.9, etc. So, the difference is hugde, and therefore it´s impossible to see data points with small values. What could be the solution? Should I use logarithmic scale? I appreciate any example.
Two possible ways below
Graph the smaller data set as a second series against a right hand Y axis (with same ratio from min to max as left hand series)
Multiply the smaller data set by 1000 and compare the multiplied data set to the larger one
Note that a log scale will give negative results given you are working with fractions, so that isn't really an option

Resources