how to interpret coefficients in log-log market mix model - statistics

I am running a multivariate OLS regression as below using weekly sales and media data. I would like to understand how to calculate the sales contribution when doing log transforms like log-linear, linear-log and log-log.
For example:
Volume_Sales = b0 + b1.TV_GRP + b2.SocialMedia + b3.PaidSearch + e
In this case, the sales contributed by TV is b1 x TV_GRPs (coefficient multiplied by the TV GRP of that month)
Now, my question is: How do we calculate sales contribution for the below cases:
Log-Linear: ln(Volume_Sales) = b0 + b1.TV_GRP + b2.SocialMedia + b3.PaidSearch + e
Linear-Log: Volume_Sales = b0 + b1.TV_GRP) + b2. ln(SocialMedia) + b3. ln(PaidSearch) + e
Log-Log: *ln(Volume_Sales) = b0 + b1.TV_GRP) + b2. ln(SocialMedia) + b3. ln(PaidSearch) + e**

In general terms, a log transformation takes something that acts on the multiplicative scale and re-represents it on the additive scale so certain mathematical assumptions hold: among them, linearity. So to step beyond the "transform data we don't like" paradigm that many of us are guilty of, I like thinking in terms of "does it make most sense if an effect to this variable is additive (+3 units) or multiplicative (3 times as much, 20% reduction, etc)?" That and your diagnostic plots (residual, q-q, etc.) will do a good job of telling you what's the most appropriate in your case.
As for interpreting coefficients, here are some ways I've seen it done.
Linear: y = b0 + b1x + e
Interpretation: there is an estimated b1-unit increase in the mean of y for every 1-unit increase in x.
Log-linear: ln(y) = b0 + b1x + e
Interpretation: there is an estimated change in the median of y by a factor of exp(b1) for every 1-unit increase in x.
Linear-log: y = b0 + b1ln(x) + e
Interpretation: there is an estimated b1*ln(2)-unit increase in the mean of y when x is doubled.
Log-log: ln(y) = b0 + b1ln(x) + e
Interpretation: there is an estimated change in the median of y by a factor of 2^b1 when x is doubled.
Note: these can be fairly readily derived by considering what happens to y if you replace x with (x+1) or with 2x.
These generic-form interpretations tend to make more sense with a bit of context, particularly once you know the sign of the coefficient. Say you've got a log-linear model with an estimated b1 of -0.3. Exponentiated, this is exp(-0.3)=0.74, meaning that there is an estimated change in the median of y by a factor of 0.74 for every 1-unit increase in x ... or better yet, a 26% decrease.

Log-linear means an exponential: ln(y) = a x + b is equivalent to y = exp(a x) * exp(b), which is of the form A^x * B. Likewise, a log-log transform gives a power law: ln(y) = a ln(x) + b is of the form y = B * x^a, with B = exp(b).
On a log-linear plot an exponential will thus be a straight line, and a power law will be on a log-log plot.

Related

Calculating contrast values on Excel

I am currently studying experimental designs in statistics and I am calculating values pertaining to 2^3 factorial designs.
The question that I have is particularly with the calculations of the "contrasts".
My goal of this question is to learn how to use the table "Coded Factors" and "Total" in order to get the values "Contrast" using the IF THEN function in Excel.
For example, Contrast A is calculated as : x - y . Where
x = sum of the values in the Total, where the Coded Factor A is + .
And y= sum of the values in the Total, where the Coded Factor A is - .
This would be rather simple, but for the interactions it is a bit more complex.
For example, contrast AC is obtained as : x - y . Where
x = sum of the values in the Total, where the product of Coded Factor A and that of C becomes + .
And y = sum of the values in the Total, where the product of Coded Factor A and that of B becomes - .
I would really appreciate your help.
Edited:
Considering the way how IF statements work, I thought that it might be a good idea to convert the + into 1 and - into -1 to make the calculation straight forward.
Convert all +/- to 1/-1. Use some cells as helper..
Put in these formulas :
J2 --> =LEFT(J1)
K2 --> =MID(J1,2,1)
L2 --> =MID(J1,3,1)
Put
J3 --> =IF(J$2="",1,INDEX($B3:$D3,MATCH(J$2,$B$2:$D$2,0)))
and drag to L10. Then
M3 --> =J3*K3*L3*G3
and drag to M10. Lastly,
M1 --> =SUM(M3:M10)
How to use : Input the Factor comb in cell J1 and the result will be in M1.
Idea : separate the factor text > load the multiplier > multiply Total values with multiplier > get sum.
Hope it helps.

Circular Cell Reference

I have a question regarding a circular cell reference. I have come up with an example that illustrates my dilemma and I attached an illustration.
Here's the deal. My house needs heat and it needs electricity:
My house needs 7 units of heat and 1 unit of electricity.
My generator is 50% efficient. So for every unit of electricity used to power the generator, I only get .5 units of electricity for my house. You can neither create nor destroy energy, so the other 50% that isn't turned into electricity, turns into heat.
My heat pump consumes 1 unit of electricity in order to produce 2 units of heat. This means that the heat pump is 200% efficient. Additionally we get to use the waste heat from the generator.
Please look at the attached example. I drew out the scenario so you can visualize it. Subscript E is used to denote electricity. Subscript H is used to denote heat.
I need to be able to change the generator efficiency, heat pump efficiency, and how much electricity the house needs.​ I would like to be able to manipulate each variable.
Can anyone help me input this into excel???
Thanks!!!
-Jon
Example of Scenario
Although I doubt the physical correctness of your approach, the math problem is solvable in Excel with goal seek.
Let's have the following sheet:
Formulas:
C4: =-(C2*B4)
D4: =-(C2+C4)
C6: =-(C4+C8)
D6: =-(C6*B6)
D8: =-(D4+D6)
D10: =D8/C8
Consumption are negative values, production are positive values.
Now change the variables, for example B4, B6 or C8 and call What-If Analysis, Goal Seek. D10 must be 7 and C2 is the changing cell.
A real math solution would be:
Generator Input = gi
Generator Efficiency = geff
Heat Pump Efficiency = hpeff
House Electricity Needed = hen
House Heat Needed = hhn
(gi*geff - hen) * hpeff + gi*(100%-geff) = hhn
gi*geff*hpeff - hen*hpeff + gi*(100%-geff) = hhn
gi*geff*hpeff + gi*(100%-geff) = hhn + hen*hpeff
gi * (geff*hpeff + (100%-geff)) = hhn + hen*hpeff
gi = (hhn + hen*hpeff ) / (geff*hpeff + (100%-geff))
6kW = (7kW + 1kW*200%) / (50%*200% + (100% - 50%))
6kW = 9kW / 1,5
You can rearrange this a little to make it more simple.
You're trying to solve for a + b:
a = Heat Pump Input
b = Generator Input
With the following input values:
c = Generator Efficiency
d = Heat Pump Efficiency
H = Heat Needed
W = Electricity Needed
Now, we know:
W = bc
H = ad + b(1-c)
Hence we can derive:
b = W/e
a = (H-W(1-c)/c)/d
Input your values for c,d,H and W and you get your result.

Explanation of normalized edit distance formula

Based on this paper:
IEEE TRANSACTIONS ON PAITERN ANALYSIS : Computation of Normalized Edit Distance and Applications In this paper Normalized Edit Distance as followed:
Given two strings X and Y over a finite alphabet, the normalized edit
distance between X and Y, d( X , Y ) is defined as the minimum of W( P
) / L ( P )w, here P is an editing path between X and Y , W ( P ) is
the sum of the weights of the elementary edit operations of P, and
L(P) is the number of these operations (length of P).
Can i safely translate the normalized edit distance algorithm explained above as this:
normalized edit distance =
levenshtein(query 1, query 2)/max(length(query 1), length(query 2))
You are probably misunderstanding the metric. There are two issues:
The normalization step is to divide W(P) which is the weight of the edit procedure over L(P), which is the length of the edit procedure, not over the max length of the strings as you did;
Also, the paper showed that (Example 3.1) normalized edit distance cannot be simply computed with levenshtein distance. You probably need to implement their algorithm.
An explanation of Example 3.1 (c):
From aaab to abbb, the paper used the following transformations:
match a with a;
skip a in the first string;
skip a in the first string;
skip b in the second string;
skip b in the second string;
match the final bs.
These are 6 operations which is why L(P) is 6; from the matrix in (a), matching has cost 0, skipping has cost 2, thus we have total cost of 0 + 2 + 2 + 2 + 2 + 0 = 8, which is exactly W(P), and W(P) / L(P) = 1.33. Similar results can be obtained for (b), which I'll left to you as exercise :-)
The 3 in figure 2(a) refers to the cost of changing "a" to "b" or the cost of changing "b" to "a". The columns with lambdas in figure 2(a) mean that it costs 2 in order to insert or delete either an "a" or a "b".
In figure 2(b), W(P) = 6 because the algorithm does the following steps:
keep first a (cost 0)
convert first b to a (cost 3)
convert second b to a (cost 3)
keep last b (cost 0)
The sum of the costs of the steps is W(P). The number of steps is 4 which is L(P).
In figure 2(c), the steps are different:
keep first a (cost 0)
delete first b (cost 2)
delete second b (cost 2)
insert a (cost 2)
insert a (cost 2)
keep last b (cost 0)
In this path there are six steps so the L(P) is 6. The sum of the costs of the steps is 8 so W(P) is 8. Therefore the normalized edit distance is 8/6 = 4/3 which is about 1.33.

How to do a linear regression in case of incomplete information about output variable

I need to do a linear regression
y <- x1 + x2+ x3 + x4
y is not known
but instead of y we have f(y) which depends on y
for example, y is a probability from 0 to 1 of a binomial distribution over 0, 1
and instead of y we have (the number of 0, the number of 1) out of (the number of 0 + the number of 1) experiments
How should I perform linear regression to find correct y
How should I take into account the amount of information provided that for some x1 x2 x3 we have n experiments which give high confidence value of y, but for other x1 x2 x3 we have low confidence value of y due to small number of measurements
Sounds like you need something like BUGS (Bayes inference Using Gibbs Sampling) for the unknown variable y.
It sounds like you might be asking for logistic regression.

Is there a way to optimise this program in Haskell?

I am doing project euler question 224. And whipped up this list comprehension in Haskell:
prob39 = length [ d | d <- [1..75000000], c <- [1..37500000], b <-[1..c], a <- [1..b], a+b+c == d, a^2 + b^2 == (c^2 -1)]
I compiled it with GHC and it has been running with above average kernel priority for over an hour without returning a result. What can I do to optimise this solution? It seems I am getting better at finding brute force solutions in a naive manner. Is there anything I can do about this?
EDIT: I am also unclear about the definition of 'integral length', does this just mean the side length has a magnitude which falls in the positive set of integers, i.e: 1,2,3,4,5... ?
My Haskell isn't amazing, but I think this is going to be n^5 as written.
It looks like you're saying for each n from 1 to 75 million, check every "barely obtuse" triangle with a perimiter less than or equal to 75 million to see if it has perimiter n.
Also I'm not certain if list comprehensions are smart enough to stop looking once the current value of c^2 -1 is greater than a^2 + b^2.
A simple refactor should be
prob39 = length [ (a, b, c) | c <- [1..37500000], b <-[1..c], a <- [1..b], a^2 + b^2 == (c^2 -1), (a + b + c) <= 75000000]
You can make it better, but that should literally be 75 million times faster.
Less certain about this refactoring, but it should also speed things up considerably:
prob39 = length [ (a, b, c) | a <- [1..25000000], b <-[a..(75000000 - 2*a)], c <- [b..(75000000 - a - b)], a^2 + b^2 == (c^2 -1)]
Syntax may not be 100% there. The idea is that a can only be 1 to 25 million (since a <= b <= c and a + b + c <= 75 million). b can only be between a and halfway from a to 75 million (since b <= c) and c can only be from b to 75 million - (a + b), otherwise the perimeter would be over 75 million.
Edit: updated code snippets, there were a couple of bugs in there.
Another quick suggestion, you can replace c <- [b..(75000000 - a - b)] with something along the lines of c <- [b..min((75000000 - a - b), sqrt(aa + bb) + 1)]. There's no need to bother checking any values of c greater than the ceiling of the square root of (a^2 + b^2). Can't remember if those are the correct min/sqrt function names in haskell though.
Getting OCD on this one, I have a couple more suggestions.
1) you can set the upper bound on b to be the min of the current upper bound and a^2 * 2 + 1. This is based on the principle that (x+1)^2 - x^2 = 2x + 1. b cannot be so much larger than a that we can guarantee that (a^2) + (b^2) < (b+1)^2.
2) set the lower bound of c to be max of b + 1 and floor(sqrt(a^2 + b^2) - 1). Just like the upper limit on C, no need to test values which couldn't possibly be correct.
Along with the suggestions given #patros.
I would like to share my observations on this problem.
If we print the values of a , b and c for some perimeter say 100000, then we can observe that a and b always take even values and c always take odd values. So if we optimize our code with these restrictions then almost half the checking can be skipped.

Resources