Calculating regional contribution to national GDP growth - statistics

Is there a simple way in R or Stata to calculate the regional contribution to national GDP growth?
For instance if I have the following, how do I calculate the contribution of the regions' growth to the overall national growth?
Region/country
% change
weight
Region 1
0.3
0.25
Region 2
0.1
0.25
Region 3
0.25
0.25
Region 4
0.15
0.25
Country
0.2
1

To get the contribution of each region, you just need to multiply the %change by it"s weight

Related

In Python: How to convert 1/8th of space to 1/6th of space?

Have got dataframe at store-product level as shown in sample below:
Store Product Space Min Max Total_table Carton_Size
11 Apple 0.25 0.0625 0.75 2 6
11 Orange 0.5 0.125 0.5 2 null
11 Tomato 0.75 0.0625 0.75 2 6
11 Potato 0.375 0.0625 0.75 2 6
11 Melon 0.125 0.0625 0.5 2 null
Scenario: All product here have space in terms of 1/8th. But if a product have carton_size other than null, then that particular product space has to be converted in terms of 1/(carton_size)th considering the Min(Space shouldn't be lesser than Min) and Max(Space shouldn't be greater than Max) values. Can get space from non-carton products but at the end, sum of 'Space' column should be equivalent/lesser than 'Total_table' value. Also, these 1/8th and 1/6th values are in relation to the 'Total_table', this total_table value is splitted as Space for each product.
Example: In above given dataframe, Three products have carton size, so we can take 1/8th space from the non-carton product selecting from top and split it as 1/24(means 1/24 + 1/24 + 1/24 = 1/8), which can be added to three carton products to make it 1/6, which forms the expected output shown below considering Min and Max values. If any of the product doesn't satisfy Min or Max condition - leave that product(eg., Tomato).
Roughly Expected Output:
Store Product Space Min Max Total_table Carton_Size
11 Apple 0.292 0.0625 0.75 2 6
11 Orange 0.375 0.125 0.5 2 null
11 Tomato 0.75 0.0625 0.75 2 6
11 Potato 0.417 0.0625 0.75 2 6
11 Melon 0.125 0.0625 0.5 2 null
Need solution in Python.
Thanks in Advance!

Why do i get different precision, recall and f1 score for different methods of calculating the macro avearage

I calculated the macro-average of the P, R and F1 of my classification using two methods. Method 1 is
print("Macro-Average Precision:", metrics.precision_score(predictions, test_y, average='macro'))
print("Macro-Average Recall:", metrics.recall_score(predictions, test_y, average='macro'))
print("Macro-Average F1:", metrics.f1_score(predictions, test_y, average='macro'))
gave this result:
Macro-Average Precision: 0.6822
Macro-Average Recall: 0.7750
Macro-Average F1: 0.7094
Method 2 is:
print(classification_report(y_true, y_pred))
gave this result:
precision recall f1-score support
0 0.55 0.25 0.34 356
1 0.92 0.96 0.94 4793
2 0.85 0.83 0.84 1047
accuracy 0.90 6196
macro avg 0.78 0.68 0.71 6196
weighted avg 0.89 0.90 0.89 6196
I expected the output in both methods to be same, since they were generated at the same time at the same run.
Can someone explain why this happened or if there is a mistake somewhere?
As far as I can tell from the classification_report results, you have multiple classes.
If you check the documentation for the single functions in the metrics modules, the default parameters consider the class '1' as the default positive class.
I think what might be happening is that in your first calculation, it is a One versus all computation (0 and 2 are negative classes and 1 is the positive class). And in the second case, you are actually taking into account a true multi class situation.

Excel and selecting variables conditionally

I have a data set which contains information by country. For example, Australia_F is the observation for Australia and Australia_Weight is the weight of Australia. Each period, represents a specific year.
Period Australia_F Canada_F Denmark_F Japan_F Australia_Weight Canada_Weight Denmark_Weight Japan_weight
1985 0.05 -0.02 0.02 0.03 0.10 0.30 0.45 0.15
1986 -0.04 -0.03 0.02 0.01 0.15 0.30 0.30 0.25
The user can input any value to the following cell. For example I have inserted 3
Weight_Modification = 3
The goal is to only include countries where the variable XXXXX_F are positive
and use those with the highest values such that the total weight of counties selected is not greater than 1.
The problem is complicated by the fact that the weight_modification variable, multiplies each individual county weight by whatever the value is. For example, the Weight for Australia would be 0.10 *3 = 0.3 in 1985.
Total weights can be less than 1.00 but can't be greater than 1.00
So taking the above data as an example and for 1985 the results would be
Australia_weight Canada_weight Denmark_weight Japan_weight Total_weight
0.3 0.45 0.75
This is because in 1985 Australia has the highest value (Australia_F = 0.05), followed by Japan (Japan_F = 0.03).
Each countries weights are multiplied by 3.
Denmark is not selected even through Denmark_F is positive, because including Denmark the total weight exceeds 1.
In the actual file there are many more countries (12 in total) and many years.
Any help with how to put this together in excel is greatly appreciated.

excel formula in one column based on variable entries in another column

My goal is to populate column G for each row with a CPTCode(column B). If I were to copy and paste by hand this would simply be: f2/f5, f3/f5,f4/f5,f6/f15,f7/f15, etc. Column A contains the names of staff. Each month staff codes for different types of procedures(columns B and C) and a variable quantity of each(column E). The rows with total in column B represent the end of a staff person's monthly list of codes. We are hoping to provide monthly updates to our service chiefs. I am hoping someone might be able to help me devise a formula I could use to copy down column G. We are looking at roughly 3000-4000 rows of code per month for about 200 different clinical providers.
CPTCode CPTName Work RVU FY16 Total Qty FY16 Total RVU CPT's % of FY RVUs
96119 NEUROPSYCH TESTING BY TECH 0.6 76 41.8
99212 OFFICE/OUTPATIENT VISIT EST 0.5 2 1.0
T1016 CASE MANAGEMENT 0.5 1 0.5
Total 79 43.3
H0038 SELF-HELP/PEER SVC PER 15MIN 0.0 727 0.0
90853 GROUP PSYCHOTHERAPY 0.6 236 139.2
99212 OFFICE/OUTPATIENT VISIT EST 0.5 153 73.4
S9446 PT EDUCATION NOC GROUP 0.4 105 42.0
99211 OFFICE/OUTPATIENT VISIT EST 0.2 44 7.9
90785 PSYTX COMPLEX INTERACTIVE 0.3 10 3.3
99202 OFFICE/OUTPATIENT VISIT NEW 0.9 1 0.9
99213 OFFICE/OUTPATIENT VISIT EST 1.0 1 1.0
H0031 MH HEALTH ASSESS BY NON-MD 0.6 1 0.6
Total 1278 268.4
H0038 SELF-HELP/PEER SVC PER 15MIN 0.0 452 0.0
98967 HC PRO PHONE CALL 11-20 MIN 0.5 1 0.5
Total 453 0.5
[1]: http://i.stack.imgur.com/dw44F.jpg

Effective reasonable indexing for numeric vector search?

I have a long numeric table where 7 columns are a key and 4 columns is a value to find.
Actually I have rendered an object with different distances and perspective angles and have calculated Hu moments for it's contour. But this is not important to the question, just a sample to imagine.
So, when I have 7 values, I need to scan a table, find closest values in that 7 columns and extract corresponding 4 values.
So, the task aspects to consider is follows:
1) numbers have errors
2) the scale in function domain is not the same as the scale in function value; i.e. the "distance" from point in 7-dimensional space should depend on that 4 values, how it affect
3) search should be fast
So the question is follows: isn't some algorithm out there to solve this task efficiently, i.e. perform some indexing on that 7 columns, but do this no like conventional databases do, but taking into account point above.
If I understand the problem correctly, you might consider using scipy.cluster.vq (vector quantization):
Suppose your 7 numeric columns look like this (let's call the array code_book):
import scipy.cluster.vq as vq
import scipy.spatial as spatial
import numpy as np
np.random.seed(2013)
np.set_printoptions(precision=2)
code_book = np.random.random((3,7))
print(code_book)
# [[ 0.68 0.96 0.27 0.6 0.63 0.24 0.7 ]
# [ 0.84 0.6 0.59 0.87 0.7 0.08 0.33]
# [ 0.08 0.17 0.67 0.43 0.52 0.79 0.11]]
Suppose the associated 4 columns of values looks like this:
values = np.arange(12).reshape(3,4)
print(values)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
And finally, suppose we have some "observations" of 7-column values like this:
observations = np.random.random((5,7))
print(observations)
# [[ 0.49 0.39 0.41 0.49 0.9 0.89 0.1 ]
# [ 0.27 0.96 0.16 0.17 0.72 0.43 0.64]
# [ 0.93 0.54 0.99 0.62 0.63 0.81 0.36]
# [ 0.17 0.45 0.84 0.02 0.95 0.51 0.26]
# [ 0.51 0.8 0.2 0.9 0.41 0.34 0.36]]
To find the 7-valued row in code_book which is closest to each observation, you could use vq.vq:
index, dist = vq.vq(observations, code_book)
print(index)
# [2 0 1 2 0]
The index values refer to rows in code_book. However, if the rows in values are ordered the same way as code_book, we can "lookup" the associated value with values[index]:
print(values[index])
# [[ 8 9 10 11]
# [ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]
# [ 0 1 2 3]]
The above assumes you have all your observations arranged in an array. Thus, to find all the indices you need only one call to vq.vq.
However, if you obtain the observations one at a time and need to find the closest row in code_book before going on to the next observation, then it would be inefficient to call vq.vq each time. Instead, generate a KDTree once, and then find the nearest neighbor(s) in the tree:
tree = spatial.KDTree(code_book)
for observation in observations:
distances, indices = tree.query(observation)
print(indices)
# 2
# 0
# 1
# 2
# 0
Note that the number of points in your code_book (N) must be large compared to the dimension of the data (e.g. N >> 2**7) for the KDTree to be fast compared to simple exhaustive search.
Using vq.vq or KDTree.query may or may not be faster than exhaustive search, depending on the size of your data (code_book and observations). To find out which is faster, be sure to benchmark these versus an exhaustive search using timeit.
i don't know if i understood well your question,but i will try giving an answer.
for each row K in the table compute the distance of your key from the key in that row:
( (X1-K1)^2 + (X2-K2)^2 + (X3-K3)^2 + (X4-K4)^2 + (X5-K5)^2 + (X6-K6)^2 + (X7-K7)^2 )^0.5
where {X1,X2,X3,X4,X5,X6,X7} is the key and {K1,K2,K3,K4,K5,K6,K7}is the key at row K
you could make one factor of the key more or less relevant of the others multiplying it while computing distance,for example you could replace (X1-K1)^2 in the formula above with 5*(X1-K1)^2 to make that more influent.
and store in a variable the distance ,in a second variable the row number
do the same with the following rows and if the new distance is lower then the one you stored then replace the distance and the row number.
when you have checked all the rows in your table the second variable you have used will show you the nearest row to the key
here is some pseudo-code:
int Row= 0
float Key[7] #suppose it is already filled with some values
float ClosestDistance= +infinity
int ClosestRow= 0
while Row<NumberOfRows{
NewDistance= Distance(Key,Table[Row][0:7])#suppose Distance is a function that outputs the distance and Table is the table you want to control Table[Row= NumberOfRows][Column= 7+4]
if NewDistance<ClosestDistance{
ClosestDistance= NewDistance
ClosestRow= Row}
increase row by 1}
ValueFound= Table[ClosestRow][7:11]#this should be the value you were looking for
i know it isn't fast but it is the best i could do,hope it helped.
P.S. i haven't considered measurement errors,i know.

Resources