Creating independent variable Stata in Panel Data model - excel

Data and description of variables
Picture 1 and Sample unbalanced paneldata
Picture 1 shows a balanced panel data that I have created using an unbalanced one provided as a sample in the same image, where I had multiple products (ID) for different amount of years (YEAR). For each product, there were a different number of Shops offering the given product (ID). So as stated, this is a balanced set created by sorting out for the same years, same products (ID), and same shops (marked by the orange area in the sample unbalanced paneldata). This is an important assumption that might affect the perception of the issue stated below. The following is therefore a description of the table shown in Picture 1:
Years indicates the amount of period a product lasts for a given product (ID)
Shop 1, Shop 2, Shop 3 indicate different prices for a given product (ID) by different firms
The minimum and second minimum value depict what shops for a given year and product (ID), have the lowest and second lowest price for that given year. This is needed to calculate the Price difference, which is **(Second minimum value - Minimum Value) / (Minimum Value)
An example of this, is given for row 5 (Year 01.01.1995 - ID 101) where Price difference would be (3999-3790)/3790 = 5,51% (In Picture 1)
Issue
In my balanced panel data, (Picture 1), I want to run a fixed effect regression in STATA using xtreg function, where the dependent variable is the Price difference, and number of shops selling a product are the independent variables. This is, so I can say how Price difference as a dependent variable is affected when there is 1 shop selling, when there are two shops selling, and when there are three shops selling.
Another problem is, is my assumption valid at all of creating a balanced panel? Is it correct to create a balanced from the unbalanced paneldata, or must I use the unbalanced panel to create such a variable?
So my main issue is how to create such independent variables, that measure the dimension of number of shops offering products. To
clarify what I mean, I have included an example of a sample fixed
effect regression that may explain the structure that I attempt to
seek, in Picture 2 below:
NOTE (In picture 2 expected cell mean to the right is the same as Price difference in Picture 1, and is used as dependent variable. They are regressed on number of firms/shops as independent variables, and these I have an issue creating)
Picture 2
What I have tried
I have tried, using dummy variables, on shops, but they ended up getting dropped. The dataset provided in picture 1 is a balanced data set as mentioned, which is needed to run (I assume) a fixed effect regression on a paneldata.
End remark
I stated this question earlier in a much more imprecise manner, where I apologiese for any inconvenience. The problem I think, might be that either I have set it up wrong in excel, hence the dummy's are dropped, or something of that nature. It might also be, that I have to use the unbalanced set in order to create this independent variable, so that might also be a problem, that I am attempting to use a balanced set instead of the unbalanced one.

In your unbalanced sample (as we discussed in the comments, the balanced sample will not make sense) we first need to create a variable for the number of shops offering each ID, let us say we have the same data as in the top portion of your Picture 1
egen number_of_firms = rownonmiss(Shop*)
xtset ID year // to use xtreg, we must tell Stata the data are panel
xtreg Price_difference i.number_of_firms
The xtreg is the regression shown in your Picture 2.
If you want the number of firms variable to be formatted a bit more like Picture 2, you can do something like this:
qui levelsof number_of_firms, local(num)
foreach n in `num' {
local lab_def `lab_def' `n' "`n' Firms"
}
label def num_firms `lab_def'
label values number_of_firms num_firms
label var number_of_firms "Number of Firms"
And then run the regression and the output will be formatted with the number of firms lables.

Related

Which is the best Data Mining model to extrapolate known values to missing values in a table? (General question)

I am working on a little data mining project (I am still a Data Science student, not a professional). Maybe you can help me to choose a proper model for my task.
So, let's say we have a table with three columns and around 4000 rows:
YEAR
COLOR
NAME
1900
Green
David
1901
Yellow
Sarah
1902
Green
???
1902
Red
Sarah
…
…
…
2020
Purple
John
Any value for any field can be repeated in the dataset (also Year values).
In the first two columns we don't have missing values, but we only have around 20% of Name values in the third column. Name value deppends somewhat on the first two columns (not a causal relation).
My goal is to extrapolate the available Name values to the whole table and get a range of occurrences for each name value (for example in a boxplot)
I have imagined a process like that, although I am not very sure if statitically it makes sense (any objections and suggestions are appreciated):
For every unknown NAME value, the algorythm choose randomly one of the already known NAME values. The odds of a particular NAME value to be chosen depend on the variables YEAR and COLOR. For instance, if 'David' values tend to be correlated with low Year values AND with 'Green' or 'Purple' values for Color, the algorythm give 'David' a higher probability to be chosen if input values for Year and Color are "1900, Purple".
When the above process ends, the number of occurrences for each name is counted.
The above process is applied 30 times and the results for each name are displayed in a plotbox.
However, I don't know which is the best model to implement an idea similar to this. I have drawn the process in a simple paint drawing:
Possible output for the task
Which do you think it could be a good approach to this task? I appreciate any help.
I think you have the process down, it's converting the data which may be the first hurdle.
I would look at using from sklearn.preprocessing import OrdinalEncoder to encode the data to convert from categorical to numeric.
You could then use a random number generator to produce a number within the range defined by the encoding which would randomly select a name.
Loop through this 30 times with an f loop to achieve the result.
It also looks like you will need to provide the ranking values for year and colour prior to building out your code. From here you would just provide bands, for example, if year > 1985, etc within your for loop to specify the names.

What test can I use to calculate significance of optimization lift?

Given the following data for 12 users:
username, number of deals for control, revenue from test, revenue from control
Here's an example of how the data looks like
Can you help me figure out how I can calculate the significance of the hypothesis that the test is more profitable (preferably using excel)?
The measure I was thinking of using was the % of lift in revenues for each customer.
P.s. I have a background in statistics but not an expert so please keep it as simple as possible.
Since each pair of incomes refers to the same individual, you can perform a paired t-test.
Variable 1: Control income
Variable 2: Deals income
Then follow these instructions (copied here for posterity):
In Excel, click Data Analysis on the Data tab.
From the Data Analysis popup, choose t-Test: Paired Two Sample for Means.
Under Input, select the ranges for both Variable 1 and Variable 2.
In Hypothesized Mean Difference, you’ll typically enter zero. This value is the null hypothesis value, which represents no effect. In
this case, a mean difference of zero represents no difference between
the two methods, which is no effect.
Check the Labels checkbox if you have meaningful variables labels in row 1. This option helps make the output easier to interpret. Ensure
that you include the label row in step #3.
Excel uses a default Alpha value of 0.05, which is usually a good value. Alpha is the significance level. Change this value only when
you have a specific reason for doing so.
Click OK.
Alternatively, you can indeed calculate the difference between the two incomes, and then perform a one sample t-test (assuming that the difference is zero). However, such a test is not available out-of-the-box in Excel; the procedure is described here.

How do I distribute a value over multiple cells evenly but under a maximum limit?

Example data with desired outcome that I need to calculate
I have 12 items of a certain current value. I have a 'soft' cap of $1,000,000 for these values. Some of the items fall above, and some below this cap level.
I have an amount of money (for this example $900,000) that I want to distribute amongst only the items that fall below the cap (in this example 6 items), with the aim of bringing the value of these items up to but not over the cap value.
If I distribute the $900,000 evenly over these 6 items (each receiving $150,000), you can see that items 2 and 9 would then be over the $1,000,000 cap. So items 2 and 9 should only receive $100,000 to raise their value to the cap, then the remaining 4 items would receive and equal share on the remaining pool of money ($700,000 / 4 = $175,000).
So I need a formula to check every item to see if it needs a distribution (i.e below the cap) and then portion/divide out the money pool as illustrated above in the desired distribution column.
Note: The pool of money to be distributed can change. Also the number of items below the cap can change. The cap value itself can change.
I am hoping to avoid VBA or Solver because the spreadsheet could be used on other people's computers.
Hopefully this makes sense. Thanks.
EDIT:
So far I have been able to get close by adding a helper column and using the following formula:
=IF(SUM($F$6:F14)=$D$23,0,E15*MIN(D15,($D$23-SUM($F$6:F14))/SUM(E15:$E$18)))
Working example when values are sorted.
This seems to work when the values are sorted in descending order, as shown in the example image above. But seems to break when the values are a bit more randomly assorted which is likely to happen (as in the original post).
Just to give you an idea of how the solver can be set up to do a capital budget model here is one, also shows the solver and its settings:

Passing parameter to the drill thru report in cognos

How to Pass a Calculated Member/measure to a Drill-thru Target Report
In order to avoid using Calculated Members--because from googling some people were saying you could not pass them via Drill-thru--I went back to my FM model and created 3 new measures (High Risk, Low Risk and Medium Risk). Now these will show up in the drill-thru definitions parameter list . . . my only problem is that how can I do a check to see which of the three measures has been selected by a user?
Remember, I basically have a line chart with 3 lines, one for each measure above (High, Medium or Low Risk) by time frame. A user will select a data point, High risk for March or Medium Risk for Semester 2, for example. I then need to pass the value for that datapoint to my Target (2nd) report. How can I check for which of the three measure values they passed through?!?

Bar Chart of student performance - Tableau

I have this dashboard. In sheet two , the bar chart is just counting the number of marks
in each range specified in the x-axis. What I actually want is a bar chart according to the same range, but it should count the average of marks of each student. In sheet 3, the bar chart looks similar to what I expect, but if you take a look, it's just adding each average of student one above the another.
So, how can I make a char bart with frequency of students average of marks. The ranges should be: [0 , 5>,[5,10>, [10,15>, [15,20].
One solution is to create a custom SQL data connection to first calculated the avg NOTA for each student as below:
select NOMBRES, avg(NOTA) as avg_nota from YOUR_TABLE group by NOMBRES
Then you can create a histogram for avg_nota, either with Show Me or manually.
Here is a link to an example based on your original
The SQL above weighs each score equally, which is fine if each course has exactly the same number of grades. But if the number of records varies between courses, you should adjust the approach to make sure each course is weighted the same (e.g. so that a course with 10 small tests does not get weighted twice as much as a course with 5 larger tests). The solution in that case, might involve repeating the above step in a nested subquery or view grouping by both NOMBRE and CURSO. Still this simple approach should give you the basic idea.
The solution above works but I think there ought to be a way to get the same effect using table calculations without resorting to custom SQL

Resources