Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I am having a little bit of trouble with a custom_filters_score. Individual scores 0.81491333, 0.125 and 0.08695652 are added up to equal 0.1727262…
EDIT: I looked over it again and it looks like the total of the custom_filters_score is being multiplied by the 'normal' query.
Is there a way to either incorporate the normal query (custom_score) into the custom_filter_score or, alternatively, a way to force elasticsearch to add the two together (instead of multiplying)?
A gist of the data, query and mapping is at https://gist.github.com/sqwk/3d7b25192a236fba82b4
OK - I've finally figured out what is happening. First, in your gist, your query doesn't match any of your documents, because of the lat/lon. I chose one document at random (323) and use the lat/lon values from there.
This is the explanation I get:
- custom score, score mode [total] | 0.8795
- Score based on score mode Max and child doc range from 10 to 16 | 1.0000
- Child[16] | 1.0000
- custom score, product of: | 0.2000
- match filter: cache(object_max_rooms:[4 TO *]) | 1.0000
- scriptFactor | 0.2000
- queryBoost | 1.0000
- custom score, product of: | 0.5714
- match filter: cache(object_min_living_area:[* TO 125]) | 1.0000
- scriptFactor | 0.5714
- queryBoost | 1.0000
- custom score, product of: | 0.1081
- match filter: cache(object_max_living_area:[125 TO *]) | 1.0000
- scriptFactor | 0.1081
- queryBoost | 1.0000
As you can see, the lat/lon matches exactly, so scores 1, and the scores from the custom_filters_score query are being totalled up nicely.
Then I changed the lat value from 50.0852386 to 50.0882386, and reran. Now the scores look like this:
- custom score, score mode [total] | 0.7081
- Score based on score mode Max and child doc range from 10 to 16 | 0.8050
- Child[16] | 0.8050
- custom score, product of: | 0.2000
- match filter: cache(object_max_rooms:[4 TO *]) | 1.0000
- scriptFactor | 0.2000
- queryBoost | 1.0000
- custom score, product of: | 0.5714
- match filter: cache(object_min_living_area:[* TO 125]) | 1.0000
- scriptFactor | 0.5714
- queryBoost | 1.0000
- custom score, product of: | 0.1081
- match filter: cache(object_max_living_area:[125 TO *]) | 1.0000
- scriptFactor | 0.1081
- queryBoost | 1.0000
So the score from the filters is being combined with the score from the query, and then normalized. This is to be expected. The score_mode only applies to the filters, not the combination of the filters and the query.
If you want to combine them exactly, then you would need to move the distance calculation out of the query into a filter under the custom_filters_score filters. The problem there is that the script for scoring won't have access to the nested places docs, so you would be unable to do that.
Why is the exact total so important? The _score should never be taken as an absolute value. It just reflects the relative importance of each document. You just need to tweak the impact of each clause until you're getting the "right" order for your requirements.
Related
I know one can use LOOKUP or VLOOKUP to find range from a list of contiguous ranges a value belongs to. I know you can do some functional but awful things with IF statements to do similar things.
What I'm looking to do:
I have a calculated value between 1 and 100% (represents how far along a lunar synodic orbital cycle the moon is).
I need to essentially identify if the calculated value falls in:
0 - 0.6 Full
11.9 - 13.1 3/4 Waning
24.4 - 25.6 1/2 Waning
36.9 - 38.1 1/4 Waning
49.4 - 50.6 No Moon
61.9 - 63.1 1/4 Waxing
74.4 - 75.6 1/2 Waxing
86.9 - 88.1 3/4 Waxing
99.4 - 100 Full
So I need to check if the calculated value falls within any of the calculated ranges and if so, return the associated text. If it does not, it would be desirable to return a blank ("").
I'm wondering if I have to just use a really ugly nested IF statement or if there is some graceful way to use one or two lookup functions to accomplish what I want. The fact the overall range one is testing against is sparse (parts of the range should return a blank) is the challenge.
One approach I can see is not using a sparse range - just filling in between each of the existing ranges a range that returns blank, then using a LOOKUP or VLOOKUP. Is that my best option or is there a better solution?
For example:
0 - 0.6 Full
0.7 - 11.8 <blank>
11.9 - 13.1 3/4 Waning
13.2 - 24.3 <blank>
24.4 - 25.6 1/2 Waning
25.7 - 36.8 <blank>
36.9 - 38.1 1/4 Waning
38.2 - 49.3 <blank>
49.4 - 50.6 No Moon
50.7 - 61.8 <blank>
61.9 - 63.1 1/4 Waxing
63.2 - 74.3 <blank>
74.4 - 75.6 1/2 Waxing
75.7 - 86.8 <blank>
86.9 - 88.1 3/4 Waxing
88.2 - 99.3 <blank>
99.4 - 100 Full
Given your original data, split into three columns and formatted as a table named PhaseTbl, with column headers Min, Max and Phase, I believe the following will do what you require, with the value to be tested in A2:
=IFERROR(INDEX(PhaseTbl,AGGREGATE(15,6,1/((A2>=PhaseTbl[Min])*(A2<=PhaseTbl[Max]))*ROW(PhaseTbl)-ROW(PhaseTbl[#Headers]),1),3),"")
Phase Table
Sample Results
You can examine how this formula works by using the formula evaluation tool.
In brief, working from the inside --> out
(A2>=PhaseTbl[Min])*(A2<=PhaseTbl[Max])
We take our value and, by multiplying the Boolean comparisons, we return an array of 1's and 0's depending on whether we satisfy the condition that in the same row the tested value is both greater than (or equal to) the Minimum and less than or equal to the maximum.
1/{1,0,0,...}
Converts the 0's into error values.
The array form of the AGGREGATE function, with the ignore errors argument, will then return the row number of the match. We adjust that for the table location to return the value from column 3 within the INDEX function.
As well as Ron's answer, you can use an array formula:
=IFERROR(INDEX(PhaseTbl[State],MATCH(1,([#Value]>=PhaseTbl[Min])*([#Value]<=PhaseTbl[Max]),0)),"")
I have another option with three basic formulas IF, COUNTIFS and INDEX(MATCH()) and no arrays. As other answers have suggested, I'd firstly recommend you split your data into 3 columns: Min, Max and Phase, so it looks like the following example.
Input Data:
A B C
1|Min |Max | Phase |
+---+-----+-----------+
| 0 |0.6 | Full |
|11.9|13.1|3/4 Waning |
|24.4|25.6|1/2 Waning |
|36.9|38.1|1/4 Waning |
|49.4|50.6| No Moon |
|61.9|63.1|1/4 Waxing |
|74.4|63.1|1/2 Waxing |
|86.9|88.1|3/4 Waxing |
|99.4|100 | Full |
With the above data starting in A1, your output data starting in F1 would look like the below example, with the below formula in G2.
Formula:
=IF(COUNTIFS(A:A,"<="&F2,B:B,">="&F2)=1,INDEX(C:C,MATCH(F2,A:A,1)),"")
Output Data:
F G
1|Value|Result |
+-----+----------+
| 0 |Full |
|0.6 |Full |
|0.7 | |
|11.8 | |
|11.9 |3/4 Waning|
|13.1 |3/4 Waning|
|13.2 | |
|24.3 | |
|24.4 |1/2 Waning|
|25.6 |1/2 Waning|
|25.7 | |
|99.3 | |
|99.4 |Full |
|100 |Full |
|110 | |
Formula Explained:
=IF(COUNTIFS(A:A,"<="&F2,B:B,">="&F2)=1,INDEX(C:C,MATCH(F2,A:A,1)),"")
Count if any of the values in the "Min" list are less than or equal to F2 AND any of the values in the "Max" list are greater than or equal to F2
If the count returns 1, return the "Phase" on the same row as the "Min" value that is less than F2 (this is done in the MATCH part of the INDEX(MATCH()) formula, with the match type set to "1: Less than")
If the count does not return 1, return a blank cell.
I have this excel table used as a DB named "csv" :
Ticket agent_wait client_wait
1 200 105
2 10 50
3 172 324
I'd like to calculate the average of the ratios of the agent wait. ration_agent being calculated as agent_wait / (agent_wait + client_wait).
If the table were like this:
Ticket agent_wait client_wait ratio_agent
1 200 105 0.65
2 10 50 0.16
3 172 324 0.24
I'd just do the average of the ratio_agent column with =AVERAGE(csv[ratio_agent]).
The problem is that this last column does not exist and I don't want to create an additional column just for this calculation.
Is there a way to do this with only a formula ?
I already tried
=AVERAGE(csv[agent_wait]/(csv[agent_wait]+csv[client_wait])) but it gives me the answer for only one line.
You can use the formula you have used, but you need to enter it as an array formula. What this means is, after entering the formula, do not press Enter, but hold Ctrl+Shift and then press Enter. The resulting formula will turn into this after you do that:
{=AVERAGE(csv[agent_wait]/(csv[agent_wait]+csv[client_wait]))}
And give your the value you are looking for. Use the correct columns (first csv[agent_want] to csv[client_wait]) if you are looking for the average client_wait instead.
It has come to me that your question might be an XY problem. Please take a read on this answer. It might help you decide on what you are actually looking for.
In brief if you want a measure of how much time:
agents spend waiting, out of all the waiting between agents and clients, calculate the totals first and get the average of these totals. Outliers e.g. a special case where an agent spent lots more time on a client than the client themselves will heavily affect this measure. Use this measure if you want to know how much time agents spend waiting when opposed to how much clients wait.
=SUM(csv(agent_wait)/sum(csv[agent_wait]+csv[client_wait]))
agents each spend waiting on any particular call, calculate the ratios first then the average of these. Outliers will not affect this measure by much and give an expected ratio of time an agent might spend on any interaction with a client. Use this measure if you want to have a guideline as to how much an agent should spend waiting for each unit of time a client spends waiting.
=AVERAGE(csv[agent_wait]/(csv[agent_wait]+csv[client_wait]))
It also wouldn't be correct to do the =AVERAGE(csv[ration_agent]) calculation. An average of the average isn't the overall average. You need to sum the parts and then compute the overall average using those parts.
Ticket | agent_wait | client_wait | ratio_agent
------ | ---------- | ----------- | -----------
1 | 200 | 105 | 0.656
2 | 10 | 50 | 0.167
3 | 172 | 324 | 0.347
Total | 382 | 479 | ?????
The question is what goes in for the ?????.
If you take the average of the ratio_agent column (i.e. =AVERAGE(Table1[ratio_agent])) then you get 0.390.
But if you compute the ratio again, but with the column totals, like =csv[[#Totals],[agent_wait]]/(csv[[#Totals],[agent_wait]]+csv[[#Totals],[client_wait]]), then you get the true answer: 0.444.
To see how this is true try this set of data:
Ticket | agent_wait | client_wait | ratio_agent
------ | ---------- | ----------- | -----------
1 | 2000 | 2000 | 0.500
2 | 10 | 1 | 0.909
Total | 2010 | 2001 |
The average of the two ratios is 0.705, but it should be clear that if the total agent wait was 2010 and the total client wait was 2001 then the true average ratio must be closer to 0.500.
Computing it using the correct calculation you get 0.501.
I'm running a basic difference-in-differences regression model with year and county fixed effects with the following code:
xtreg ln_murder_rate i.treated##i.after_1980 i.year ln_deprivation ln_foreign_born young_population manufacturing low_skill_sector unemployment ln_median_income [weight = mean_population], fe cluster(fips) robust
i.treated is a dichotomous measure of whether or not a county received the treatment over the lifetime of the study and after_1980 measures the post period of the treatment. However, when I run this regression, the estimate for my treatment variable is omitted so I can't really interpret the results. Below is a screen shot of the output. Would love some guidance on what to check so that i can get an estimate for the treated variables prior to treatment.
xtreg ln_murder_rate i.treated##i.after_1980 i.year ln_deprivation ln_foreign_bo
> rn young_population manufacturing low_skill_sector unemployment ln_median_income
> [weight = mean_population], fe cluster(fips) robust
(analytic weights assumed)
note: 1.treated omitted because of collinearity
note: 2000.year omitted because of collinearity
Fixed-effects (within) regression Number of obs = 15,221
Group variable: fips Number of groups = 3,117
R-sq: Obs per group:
within = 0.2269 min = 1
between = 0.1093 avg = 4.9
overall = 0.0649 max = 5
F(12,3116) = 89.46
corr(u_i, Xb) = 0.0502 Prob > F = 0.0000
(Std. Err. adjusted for 3,117 clusters in fips)
---------------------------------------------------------------------------------
| Robust
ln_murder_rate | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
1.treated | 0 (omitted)
1.after_1980 | .2012816 .1105839 1.82 0.069 -.0155431 .4181063
|
treated#|
after_1980 |
1 1 | .0469658 .0857318 0.55 0.584 -.1211307 .2150622
|
year |
1970 | .4026329 .0610974 6.59 0.000 .2828376 .5224282
1980 | .6235034 .0839568 7.43 0.000 .4588872 .7881196
1990 | .4040176 .0525122 7.69 0.000 .3010555 .5069797
2000 | 0 (omitted)
|
ln_deprivation | .3500093 .119083 2.94 0.003 .1165202 .5834983
ln_foreign_born | .0179036 .0616842 0.29 0.772 -.1030421 .1388494
young_populat~n | .0030727 .0081619 0.38 0.707 -.0129306 .0190761
manufacturing | -.0242317 .0073166 -3.31 0.001 -.0385776 -.0098858
low_skill_sec~r | -.0084896 .0088702 -0.96 0.339 -.0258816 .0089025
unemployment | .0335105 .027627 1.21 0.225 -.0206585 .0876796
ln_median_inc~e | -.2423776 .1496396 -1.62 0.105 -.5357799 .0510246
_cons | 2.751071 1.53976 1.79 0.074 -.2679753 5.770118
----------------+----------------------------------------------------------------
sigma_u | .71424066
sigma_e | .62213091
rho | .56859936 (fraction of variance due to u_i)
---------------------------------------------------------------------------------
This is borderline off-topic since this is essentially a statistical question.
The variable treated is dropped because it is time-invariant and you are doing a fixed effects regression, which transforms the data by subtracting the average for each panel for each covariate and outcome. Treated observations all have treated set to one, so when you subtract the average of treated for each panel, which is also one, you get a zero. Similarly for control observations, except they all have treated set to zero. The result is that the treated column is all zeros and Stata drops it because otherwise the matrix is not invertible since there is no variation.
The parameter you care about is treated#after_1980, which is the DID effect and is reported in your output. The fact that treated is dropped is not concerning.
I am using tabstat in Stata, and using estpost and esttab to get its output to LaTeX. I have
tabstat
to display statistics by group. For example,
tabstat assets, by(industry) missing statistics(count mean sd p25 p50 p75)
The question I have is whether there is a way for tabstat (or other Stata commands) to display the output ordered by the value of the mean, so that those categories that have higher means will be on top. By default, Stata displays by alphabetical order of industry when I use tabstat.
tabstat does not offer such a hook, but there is an approach to problems like this that is general and quite easy to understand.
You don't provide a reproducible example, so we need one:
. sysuse auto, clear
(1978 Automobile Data)
. gen Make = word(make, 1)
. tab Make if foreign
Make | Freq. Percent Cum.
------------+-----------------------------------
Audi | 2 9.09 9.09
BMW | 1 4.55 13.64
Datsun | 4 18.18 31.82
Fiat | 1 4.55 36.36
Honda | 2 9.09 45.45
Mazda | 1 4.55 50.00
Peugeot | 1 4.55 54.55
Renault | 1 4.55 59.09
Subaru | 1 4.55 63.64
Toyota | 3 13.64 77.27
VW | 4 18.18 95.45
Volvo | 1 4.55 100.00
------------+-----------------------------------
Total | 22 100.00
Make here is like your variable industry: it is a string variable, so in tables Stata will tend to show it in alphabetical (alphanumeric) order.
The work-around has several easy steps, some optional.
Calculate a variable on which you want to sort. egen is often useful here.
. egen mean_mpg = mean(mpg), by(Make)
Map those values to a variable with distinct integer values. As two groups could have the same mean (or other summary statistic), make sure you break ties on the original string variable.
. egen group = group(mean_mpg Make)
This variable is created to have value 1 for the group with the lowest mean (or other summary statistic), 2 for the next lowest, and so forth. If the opposite order is desired, as in this question, flip the grouping variable around.
. replace group = -group
(74 real changes made)
There is a problem with this new variable: the values of the original string variable, here Make, are nowhere to be seen. labmask (to be installed from the Stata Journal website after search labmask) is a helper here. We use the values of the original string variable as the value labels of the new variable. (The idea is that the value labels become the "mask" that the integer variable wears.)
. labmask group, values(Make)
Optionally, work at the variable label of the new integer variable.
. label var group "Make"
Now we can tabulate using the categories of the new variable.
. tabstat mpg if foreign, s(mean) by(group) format(%2.1f)
Summary for variables: mpg
by categories of: group (Make)
group | mean
--------+----------
Subaru | 35.0
Mazda | 30.0
VW | 28.5
Honda | 26.5
Renault | 26.0
Datsun | 25.8
BMW | 25.0
Toyota | 22.3
Fiat | 21.0
Audi | 20.0
Volvo | 17.0
Peugeot | 14.0
--------+----------
Total | 24.8
-------------------
Note: other strategies are sometimes better or as good here.
If you collapse your data to a new dataset, you can then sort it as you please.
graph bar and graph dot are good at displaying summary statistics over groups, and the sort order can be tuned directly.
UPDATE 3 and 5 October 2021 A new helper command myaxis from SSC and the Stata Journal (see [paper here) condenses the example here with tabstat:
* set up data example
sysuse auto, clear
gen Make = word(make, 1)
* sort order variable and tabulation
myaxis Make2 = Make, sort(mean mpg) descending
tabstat mpg if foreign, s(mean) by(Make2) format(%2.1f)
I would look at the egenmore package on SSC. You can get that package by typing in Stata ssc install egenmore. In particular, I would look at the entry for axis() in the helpfile of egenmore. That contains an example that does exactly what you want.
I have a Prediction cell (A1), a Results cell (B1) and a Difference cell (C1) in MS Excel. I have an Accuracy cell (D1) that I would like to show the accuracy of the prediction based on the result.
For example, if I have 24 as a prediction and 24 as the result, the accuracy should be 100% with a difference of 0. If I have a prediction of 24 and a result of 12, then the accuracy should be 50% with a difference of -12. If I have a prediction of 24 and a result of 48, then the accuracy should be 50% with a difference of 12.
Here is the calculation I'm using in the Accuracy cell (D1):
=(((C1+100)*A1)/B1)/(C1+100)
This calculation is only showing the expected results when the Prediction cell is higher than or equal to the Results cell.
Its a pretty simple forumla as far as i can see.
If A1 is the cell with the prediction in and A2 is the Actual result then use this formula
=IF(A1<A2,A1/A2,A2/A1)
This gives the percentage of prediction vs actual in either case
experimentally, when the following values were used, then the following results came from this cell
Prediction: 2 | Actual 10 | Accuracy: 20%
Prediction: 4 | Actual 10 | Accuracy: 40%
Prediction: 6 | Actual 10 | Accuracy: 60%
Prediction: 8 | Actual 10 | Accuracy: 80%
Prediction: 10 | Actual 10 | Accuracy: 100%
Prediction: 12 | Actual 10 | Accuracy: 83%
Prediction: 14 | Actual 10 | Accuracy: 71%
Prediction: 16 | Actual 10 | Accuracy: 63%
Prediction: 18 | Actual 10 | Accuracy: 56%
Prediction: 20 | Actual 10 | Accuracy: 50%
Hope that solves your problem.
Format the cell in Col C as "%" and then try this formula
Formula in C1 as per the snapshot
=IF(B1<A1,VALUE("-"&(B1/A1)),(B1/A1))
Formula in D1 as per the snapshot
=B1-A1
Various test Scenarios below.
Let me know if I have misunderstood your question and I will rectify my answer.
FOLLOWUP
Try this new formula
=IF(B1<A1,VALUE("-"&(B1/A1)),IF(B1=A1,B1/A1,(B1-A1)/A1))
Try this:
=(1 - ((B1 - A1)/A1)) * 100
EDIT:
I just gave it more thought. It looks like the definition of accurate is quite subjective here. Try the following formula and let me know if this works any better.
=(1-(ABS(B1-A1)/A1))*100
This is how I do it:
=MIN(A1:B1)/MAX(A1:B1)
Below formula will give the accuracy %ge in Excel.
Click on the % button after applying it:
=1-(ABS((Predicted/Actual)-1))