Scatter plot linear trend does not match data analysis toolpak

Scatter plot linear trend does not match data analysis toolpak - excel-formula

When I create a scatterplot of my data, and go to Add Trendline..., the trendline that I get is y = 0.5425x + 12.205
When I run the same data set through the Data Analysis Toolpak (Regression), I get a trendline of y=1.65333 - 17.26667
Aren't these two things supposed to be the same, except perhaps for rounding? What are some common causes of this issue? I've already checked to make sure all of my data values are included in both.
Edit: here is the data set (y is the first column, x is the second; can't get this to format properly in stackoverflow)
y: 3, 4, 8, 7, 15, 25, 35, 45, 60, 80
x: 5, 10, 15, 20, 25, 30, 35, 40, 45, 50
Edit (update): I verified by hand, and the results of the Data Analysis Toolpak are correct; the trendline on the scatter plot is incorrect.

I found the source of the error: the Data Analysis Toolpak understood that I had the columns ordered (y, x) (i.e., y was in column A and x was in column B); however, the scatter plot did not. So the scatter plot was doing x vs y instead of y vs x.

Related

Python - Invert list order

I want to invert list order without changing the values.
The original list is the following:
[15, 15, 10, 8, 73, 1]
While the resulting expecting list is:
[10, 8, 15, 15, 1, 73]
The example has been taken from a real data handling problem from a more complex pandas data frame.
I proposed a list problem only to simplify the issue. So, it can also be a pandas function.

zlist = int(len(list)/2)
for i in range(0, zlist):
a, b = list.index(sorted(list, reverse=True)[i]), list.index(sorted(list,reverse=False)[i])
list[b], list[a] = list[a], list[b]

Hue, colorbar, or scatterplot colors do not match in seaborn.scatterplot

Using an example from another post, I'm adding a color bar to a scatter plot. The idea is that both dot hue, and colorbar hue, should conform to the maximum and minimum possible, so that the colorbar can reflect the range of values in the hue:
x= [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200]
y= [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200]
z= [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 255]
df = pd.DataFrame(list(zip(x, y, z)), columns =['x', 'y', 'z'])
colormap=matplotlib.cm.viridis
#A continuous color bar needs to be added independently
norm = plt.Normalize(df.z.min(), df.z.max())
sm = plt.cm.ScalarMappable(cmap=colormap, norm=norm)
sm.set_array([])
fig = plt.figure(figsize = (10,8), dpi=300)
ax = fig.add_subplot(1,1,1)
sb.scatterplot(x="x", y="y",
hue="z",
hue_norm=(0,255),
data=df,
palette=colormap,
ax=ax
)
ax.legend(bbox_to_anchor=(0, 1), loc=2, borderaxespad=0., title='hue from sb.scatterplot')
ax.figure.colorbar(sm).set_label('hue from sm')
plt.xlim(0,255)
plt.ylim(0,255)
plt.show()
Note how the hue from the scatterplot, even with hue_norm, ranges up to 300. In turn, the hue from the colorbar ranges from 0 to 255. From experimenting with values in hue_norm, it seems that matplotlib always rounds it off so that you have a "good" (even?) number of intervals.
My questions are:
Is which one is showing an incorrect range: the scatterplot, the scatterplot legend, or the colorbar? And how to correct it?
How could you retrieve min and max hue from the scatterplot (in this case 0 and 300, respectively), in order to set them as maximum and minimum of the colorbar?

Do you really need to use seaborn's scatterplot(). Using a numerical hue is always quite messy.
The following code is much simpler and yields an unambiguous output
fig, ax = plt.subplots()
g = ax.scatter(df['x'],df['y'], c=df['z'], cmap=colormap)
fig.colorbar(g)

Defining metrics when evaluating multiple values per sample

I have an application that executes a
function foo() {...}
several times for each user session. There are 2 alternate algorithms that i can implement as "foo" function and my goal is to evaluate them based on execution delay using A/B testing.
The number of times foo() is called per user session is variable but will not exceed 10000.
The range of each value is between [1 - 400] milliseconds.
Say delays values are:
Algo1: [ [12, 30, 20, 40, 280] , [13, 14, 15, 100, 10], [20, 40] , ... ]
Algo2: [ [1, 10, 5, 4, 150] , [14, 10, 20], [21, 33, 41, 79], ... ]
My question is whats the best metric to pick the winner ?
possible options
average from each session, and then evaluate cdf
median from each session and then evaluate cdf
anything else ?

One possibility which captures both mean performance and volatility (variability) is quadratic loss: ℓ = (Y - τ)2, where Y's are the individual outcomes and τ is a desired target value (in your case zero). Calculate the average loss across all observations for each of your algorithms, which estimates the expected loss E[ℓ], then pick the algorithm with the smallest average loss.
It's straightforward to show that under expectation E[ℓ] = (E[Y] - τ)2 + σ2Y. In other words, quadratic loss has two components:
how far the expected value of the Y's is from your target τ; and
how variable the Y's are.
Low loss is achieved by being consistently close to the target. With a target of zero, this means you're getting values that on average are close to zero and aren't subject to large discrepancies. Either large means or large variances will inflate the loss, so minimum loss requires both aspects to perform well simultaneously.

How to increase color resolution in python matplotlib 3D plots

(edited to make the code clearer) I am using Poly3DCollection to make a graph where several polygons in a 3D space have a colour that depends on a value contained in a separate array.
cmap = cm.plasma
quantity = [0.1, 0.11, 5, 10]
colours = cmap(quantity)
for i in range(K):
x = [0, 1, 1, 0]
y = [0, 0, 1, 1]
z = [0, 1, 0, 1]
verts = [list(zip(x, y, z))]
ax.add_collection3d(Poly3DCollection(verts, color=colours[i]))
the problem I have is that the resulting image has a very limited colour resolution, and most of the polygons have the same colours.
I understood from this post that it may depend from python automatically using only 7 different colour levels, but unfortunately the solution in the post only applies to 2D plots.
Any idea on how to extend that to 3D plots?

How to force Plot.ly Python to use a given yaxis range?

As you can see below, I manually define the range for each yaxis as well as setting the autorange option to be False.
However, if you graph this, you will still find the yaxis1 range is 0 to 20 rather than 0 to 25. As a result, one of the bars sticks out of the chart.
How do I make it so that I can be certain every value will be contained within the yaxis range?
Edit: Additionally, the top grid line in the second row is not showing. If I rescale slightly, it will appear again. So the issue seems to be purely graphical. Any ideas are appreciated.
from plotly import tools
fig = tools.make_subplots(rows=2, cols=2, subplot_titles=['A', 'B'], shared_xaxes=False, shared_yaxes=True)
data = [[10, 4, 15, 20.5], [3, 12, 22.2], [6.5, 12, 26.2], [18, 4.2, 22.2]]
traces = [go.Bar(x=['Type A', 'Type B', 'Type C'], y=d) for d in data]
fig.append_trace(traces[0], 1, 1)
fig.append_trace(traces[1], 1, 2)
fig.append_trace(traces[2], 2, 1)
fig.append_trace(traces[3], 2, 2)
fig['layout']['yaxis1'].update(title='', range=[0, 25], autorange=False)
fig['layout']['yaxis2'].update(title='', range=[0, 30], autorange=False)
py.iplot(fig)

So I tried your code and was able to replicate the issue.
Reason:
The cause for this, is that, if you look at the top left graph's yaxis you can see there are 3 values [0, 10, 20], so there is a difference of 10, between each of the values. so when you set the range as [0, 25], the difference of 10 is not met, hence we not able to see 25 in the yaxis.
If we look at the graph on the bottom left's xaxis, we can see that the value 30 obeys the difference of 10, between each of the values. Thus we are able to see 30 in the yaxis!
Solution:
If you look at the plotly documentation, found here, we can use a particular property of the yaxis object to set the increment between each of the ticks, called dtick, plotly defines it as:
P.S: A personal Thank you to Maximilian Peters for aiding to find the solution!!!!
dtick (number or categorical coordinate string) Sets the step
in-between ticks on this axis. Use with tick0. Must be a positive
number, or special strings available to "log" and "date" axes. If the
axis type is "log", then ticks are set every 10^(n"dtick) where n is
the tick number. For example, to set a tick mark at 1, 10, 100, 1000,
... set dtick to 1. To set tick marks at 1, 100, 10000, ... set dtick
to 2. To set tick marks at 1, 5, 25, 125, 625, 3125, ... set dtick to
log_10(5), or 0.69897000433. "log" has several special values; "L",
where f is a positive number, gives ticks linearly spaced in value
(but not position). For example tick0 = 0.1, dtick = "L0.5" will
put ticks at 0.1, 0.6, 1.1, 1.6 etc. To show powers of 10 plus small
digits between, use "D1" (all digits) or "D2" (only 2 and 5). tick0
is ignored for "D1" and "D2". If the axis type is "date", then you
must convert the time to milliseconds. For example, to set the
interval between ticks to one day, set dtick to 86400000.0. "date"
also has special values "M" gives ticks spaced by a number of
months. n must be a positive integer. To set ticks on the 15th of
every third month, set tick0 to "2000-01-15" and dtick to "M3". To
set ticks every 4 years, set dtick to "M48"
So, when we set the dtick as 5 and the range as [0,25] we will get the expected result!
Please tryout the below code and let me know if your issue is resolved completely!
import pandas as pd
import plotly.offline as py_offline
import plotly.graph_objs as go
py_offline.init_notebook_mode()
from plotly import tools
fig = tools.make_subplots(rows=2, cols=2, subplot_titles=['A', 'B'], shared_xaxes=False, shared_yaxes=True)
data = [[10, 4, 15, 20.5], [3, 12, 22.2], [6.5, 12, 26.2], [18, 4.2, 22.2]]
traces = [go.Bar(x=['Type A', 'Type B', 'Type C'], y=d) for d in data]
fig.append_trace(traces[0], 1, 1)
fig.append_trace(traces[1], 1, 2)
fig.append_trace(traces[2], 2, 1)
fig.append_trace(traces[3], 2, 2)
fig['layout']['yaxis1'].update(title='', range=[0, 25], dtick=5, autorange=False)
fig['layout']['yaxis2'].update(title='', range=[0, 30], autorange=False)
py_offline.iplot(fig)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scatter plot linear trend does not match data analysis toolpak - excel-formula

I found the source of the error: the Data Analysis Toolpak understood that I had the columns ordered (y, x) (i.e., y was in column A and x was in column B); however, the scatter plot did not. So the scatter plot was doing x vs y instead of y vs x.

Related

Python - Invert list order

Hue, colorbar, or scatterplot colors do not match in seaborn.scatterplot

Defining metrics when evaluating multiple values per sample

How to increase color resolution in python matplotlib 3D plots

How to force Plot.ly Python to use a given yaxis range?

Categories

Resources