How to normalize impression bias when calculate the ctr?

How to normalize impression bias when calculate the ctr? - statistics

when we calculate the ctr from the data, like this
#(click)
ctr = ----------------
#(impressions)
if the number of impressions is too small, the calculted ctr is higher then the bigger impressions.
Is there any way to calculate normalized-ctr for evaluating each ads's performance?
How can I normalize?

As you know, it is impossible the number of clicks would be higher than impressions! Because in theory, each click is accompanied by an impression all the time! Hence, ctr <= 1 unless you have a technical problem in counting impressions or clicks.

Related

Iterative calculation to find desired number

I am trying to use a circular reference with iterative calculations to find a specific number however my file will not display the number. I turn iterative calculations but it fluctuates wildly even though there is a definitive answer. I have included an image of my example.

In general, circular references are a result of an incorrect structure in formulas. Enabling iterative calculations is not a solution to this problem. They are only intended in situations where repeatedly performing a calculation will lead to stable result.
(An example of useful iterative calculations is field analysis. You can define values around the edge of a field of any shape. In the interior, the value of every point is the average of all the points around it. After several thousand iterations, everything will stabilise to an evenly distributed field.)
In your situation, iterative calculations does not lead to a stable result. When Ending Cash is -300, the Cash Deposit is set to 300. But then the Ending Cash is 0, so the Cash Deposit is set to 0. But then the End Cash is -300, so the Cash Deposit is set to 300, ad finitum.
The calculation of how much Cash Deposit is required to ensure the End Cash is greater than or equal to 0 can't be based on the End Cash or the Total Deposit, as it will keep contradicting itself as to whether or not a Cash Deposit is required.
Instead, the calculation of how much Cash Deposit is required must look at all other deposits excluding itself and compare that to the total withdrawls. (By not referring to the Total Deposits or End Cash, there is no circular reference). It can be set up as:
=IF(B2+B5<B14,B14-(B2+B5),0)
It adds together all other Deposits excluding itself and compares that to the Total Withdrawals to determine whether any additional deposit is required. If an additional deposit is required, it then subtracts all other Deposits from the Total Withdrawals to find out how much it needs to be.
A more elegant way of writing such a formula is:
=MAX(B14-B5-B2,0)
This simply subtracts all deposits from the total withdrawal, and returns the difference if it is needed when it is positive (more withdrawals), or returns 0 if it is not needed when the difference is negative (more deposits).

Generate permutations

I have n players to assign to n games. 10 <= n <= 20. Each player can sign up for up to 3 games but will only get one. Different players have different score for each game they sign up for.
Example with 10 players:
It's always possible to assign players x to game x but it will not always give the highest score in total.
My goal is to get as high score as possible and I therefore want to test the different permutations. I could teoretically test all permutations and throw away the unfeasible ones but it will give me a hughe number of possibilities (n!).
Is it possible to reduce the problem with the sign up limit of max 3 games? Maybe this can be done more easily than my approach? Any thoughts?
I'm working in Excel VBA.
I hope you find this as interesting as I do ...
Sorry if you find this unclear! My question is if it's possible to generate a subset of all the permutations. More precise only the feasible ones (which are the ones without any zero score).

Well, just set this up in the solver using Linear Programming as you can see in the image. Have shown the formulae so you can build it as well, along with the solver settings.
Won't give the permutations, but does solve for the highest combination.
Edit, updated image... it now shows correct ranges for the calculations, after trying to make it fit a reasonable size...

How to choose the right value of k in K Nearest Neighbor

I have a dataset with 9448 data points (rows)
Whenever I choose values of K ranging BETWEEN 1 to 10, the accuracy comes out to be 100 percent ( which is an ideal case ofcourse! ) and wierd.
If I choose my K value to be be 100 or above the accuracy decreases gradually (95% to 90%).
How does one choose the value of K? We want a decent accuracy and not hypothetical as 100 percent

Well, a simple approach to select k is sqrt(no. of datapoints). In this case, it will be sqrt(9448) = 97.2 ~ 97. And please keep in mind that It is inappropriate to say which k value suits best without looking at the data. If training samples of similar classes form clusters, then using k value from 1 to 10 will achieve good accuracy. If data is randomly distributed then one cannot say which k value will give the best results. In such cases, you need to find it by performing an empirical analysis.

Is there a ranking metric based on percentages that favors larger magnitudes?

I have two groups, "in" and "out," and item categories that can be split up among the groups. For example, I can have item category A that is 99% "in" and 1% "out," and item B that is 98% "in" and 2% "out."
For each of these items, I actually have the counts that are in/out. For example, A could have 99 items in and 1 item out, and B could have 196 items that are in and 4 that are out.
I would like to rank these items based on the percentage that are "in," but I would also like to give some priority to items that have larger overall populations. This is because I would like to focus on items that are very relevant to the "in" group, but still have a large number of items in the "out" group that I could pursue.
Is there some kind of score that could do this?

I'd be tempted to use a probabilistic rank — the probability that an item category is from the group given the actual numbers for that category. This requires making some assumptions about the data set, including why a category may have any out-of-group items. You might take a look at the binomial test or the Mann-Whitney U test for a start. You might also look at some other kinds of nonparametric statistics.

I ultimately ended up using bayesian averaging, which was recommended in this post. The technique is briefly described in this wikipedia article and more thoroughly described in this post by Evan miller and this post by Paul Masurel.
In bayesian averaging, "prior values" are used to influence the numerator and denominator towards the expected values. Essentially, the expected numerator and expected denominator are added to the actual numerator and denominator. In the case where the numerator and denominator are small, the prior values have a larger impact because they represent a larger proportion of the new numerator/denominator. As the numerators and denominators grow in magnitude, the bayesian average starts to approach the actual average due to increased confidence.
In my case, the prior value for the average was fairly low, which biased averages with small denominators downward.

Statistically removing erroneous values

We have a application where users enter prices all day. These prices are recorded in a table with a timestamp and then used for producing charts of how the price has moved... Every now and then the user enters a price wrongly (eg. puts in a zero to many or to few) which somewhat ruins the chart (you get big spikes). We've even put in an extra confirmation dialogue if the price moves by more than 20% but this doesn't stop them entering wrong values...
What statistical method can I use to analyse the values before I chart them to exclude any values that are way different from the rest?
EDIT: To add some meat to the bone. Say the prices are share prices (they are not but they behave in the same way). You could see prices moving significantly up or down during the day. On an average day we record about 150 prices and sometimes one or two are way wrong. Other times they are all good...

Calculate and track the standard deviation for a while. After you have a decent backlog, you can disregard the outliers by seeing how many standard deviations away they are from the mean. Even better, if you've got the time, you could use the info to do some naive Bayesian classification.

That's a great question but may lead to quite a bit of discussion as the answers could be very varied. It depends on
how much effort are you willing to put into this?
could some answers genuinely differ by +/-20% or whatever test you invent? so will there always be need for some human intervention?
and to invent a relevant test I'd need to know far more about the subject matter.
That being said the following are possible alternatives.
A simple test against the previous value (or mean/mode of previous 10 or 20 values) would be straight forward to implement
The next level of complexity would involve some statistical measurement of all values (or previous x values, or values of the last 3 months), a normal or Gaussian distribution would enable you to give each value a degree of certainty as to it being a mistake vs. accurate. This degree of certainty would typically be expressed as a percentage.
See http://en.wikipedia.org/wiki/Normal_distribution and http://en.wikipedia.org/wiki/Gaussian_function there are adequate links from these pages to help in programming these, also depending on the language you're using there are likely to be functions and/or plugins available to help with this
A more advanced method could be to have some sort of learning algorithm that could take other parameters into account (on top of the last x values) a learning algorithm could take the product type or manufacturer into account, for instance. Or even monitor the time of day or the user that has entered the figure. This options seems way over the top for what you need however, it would require a lot of work to code it and also to train the learning algorithm.
I think the second option is the correct one for you. Using standard deviation (a lot of languages contain a function for this) may be a simpler alternative, this is simply a measure of how far the value has deviated from the mean of x previous values, I'd put the standard deviation option somewhere between option 1 and 2

You could measure the standard deviation in your existing population and exclude those that are greater than 1 or 2 standard deviations from the mean?
It's going to depend on what your data looks like to give a more precise answer...

Or graph a moving average of prices instead of the actual prices.

Quoting from here:
Statisticians have devised several methods for detecting outliers. All the methods first quantify how far the outlier is from the other values. This can be the difference between the outlier and the mean of all points, the difference between the outlier and the mean of the remaining values, or the difference between the outlier and the next closest value. Next, standardize this value by dividing by some measure of scatter, such as the SD of all values, the SD of the remaining values, or the range of the data. Finally, compute a P value answering this question: If all the values were really sampled from a Gaussian population, what is the chance of randomly obtaining an outlier so far from the other values? If the P value is small, you conclude that the deviation of the outlier from the other values is statistically significant.
Google is your friend, you know. ;)

For your specific question of plotting, and your specific scenario of an average of 1-2 errors per day out of 150, the simplest thing might be to plot trimmed means, or the range of the middle 95% of values, or something like that. It really depends on what value you want out of the plot.
If you are really concerned with the true max and true of a day's prices, then you have to deal with the outliers as outliers, and properly exclude them, probably using one of the outlier tests previously proposed ( data point is x% more than next point, or the last n points, or more than 5 standard deviations away from the daily mean). Another approach is to view what happens after the outlier. If it is an outlier, then it will have a sharp upturn followed by a sharp downturn.
If however you care about overall trend, plotting daily trimmed mean, median, 5% and 95% percentiles will portray history well.
Choose your display methods and how much outlier detection you need to do based on the analysis question. If you care about medians or percentiles, they're probably irrelevant.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string