Calculating "Reliability" of Cricket Stats - statistics

I'm not a statistician so please forgive the (mis)use of terminology.
I am calculating strike rates for batsmen in cricket. For non-cricket fans, this is the number of runs scored (broadly the same as points in other sports) per 100 balls faced.
So if a batsman has faced 100 balls in his career and scored 150 runs his striker rate would be 150 (runs/balls*100).
I now want to calculate how likely it is that the stat is an accurate representation of the batsman's ability.
The more balls batsmen have faced the more likely it is that the resulting stat is accurate but how do I calculate how reliable it is?
Any help would be appreciated.
Thanks

You have a point estimate, and a confidence interval can help you quantify your uncertainty. For your example of 150 runs in 100 balls, is there a certain number of balls per run? If so, you can definitely create a confidence interval using the standard formula and choose your level of confidence.
E.g. X bar +/- t_{99, 1-alpha/2} s/sqrt(n) is a 1-alpha/2 level confidence interval for the average hits per ball
Multiplying by 100 gives a CI for the average hits per 100 balls
Unfortunately, if you have no other information than the aggregate 150 runs in 100 balls, there is not much you can do

Related

Weighted Average of Two Complier Average Treatment Effects

So I'm taking the weighted average of two complier average treatment effects (CATEs) for anassignment, but I'm not sure how to apportion the appropriate weights. Let me explain why I'm taking this average.
I am given data from a fictional randomized experiment testing the effects of get-out-the-vote efforts on turnout of urban and non-urban areas. Approximately half of the sample is of people who live in urban and non-urban areas, respectively, but they were not randomly assigned to the treatment and control group. That is, the treatment group is about 80% non-urban (the rest urban) and the control group is 80% urban (the rest non-urban). This creates a confounder because, everything else being equal, urbanites were less likely to vote than non-urbanites (at least in the fictional data).
I am being asked to estimate an overall compliance average treatment effect (CATE) for get-out-the-vote interventions while accounting for this confounder. To do this, I found separate a CATE for urban and non-urban parts of the sample, and I need to find an overall estimate from the two CATEs by taking a weighted average of them.
However, I'm not sure how to assign the appropriate weights. My professor has told us to assign more weight to the group that has more variation in the treatment. Since 80% of the treatment group is non-urban, should I assign a weight of .8 to the non-urban CATE and .2 to the urban one? (i.e., overall CATE = (.8)non-urban CATE + (.2)urban CATE)
For background, the data can be found here: https://press.princeton.edu/student-resources/thinking-clearly-with-data. It's the "GOTV_Experiment.csv" data. Thanks in advance for your help!

Actuarial vs. predicted survival comparison

I have a set of patients and their actuarial 1- and 5-years survival. I have also used their data with a certain commonly utilised medical score, that calcualtes survival probability for 1- and 5-years (for example 75% and 55% respectively). I'd like to compare both survival rates.
I did calculate the mean survival probability for all patients at 1- and 5-years as the mean of predicted survival probabilities. I then calculated the mean actuarial survival by using 100% if alive at 1 year and 0% if dead at 5 years. I then compared the means of both groups with a t-test.
I have a feeling that what i am doing is grossly incorrect and goes against all rules of statistics, however i have not find any solution of my problem anywhere. Maybe someone can help me? R packages and codes are welcome.

How to find the maximum and lowest value of a random normal or log-normal distribution?

This is my first question on Stack Overflow so forgive me if I'm not in conformity with some norms. That being said, this is my problem:
Edited:
I have a continuous variable where I can only measure some points of data and I need to assess the probability curve for the maximum and lowest values between each data point. I have the std deviation and the variable works on lognormal distribution, this means the average is a log-mean and the std deviation is multiplicative.
Example:
Assuming a car's speed is normally distributed and there are no traffic laws, at 10 AM the car is travelling at the speed of 40 MPH, at 11 AM he is travelling at 60 MPH, the standard deviation is a 10% change of its speed every hour. There is this 1h blackout in between where you have no information, but you should be able to estimate: the more probable highest speed the car achieved in this time, the more probable lowest speed, and somehow a probability distribution of everything in between. You can even assume Its the least unlikely probability that its speed at 10 AM was its lowest speed and at 11 AM was it highest speed in the period (if the car speed is truly random at every scale you can even assume its limiting the impossible). The outcome is a lognormal distribution which could be used to simulate scenarios regarding that car.
I'm not an expert in statistics and I understand only the basics and some theory, how should I address this problem?
I'm using this on Python 3.x in case you guys know an way to address that problem there.

What is the probability of more than 100 people arriving at the station, if they come based on exponential distribution with 2 mins?

So, i got this problem:
"You have people arriving at the bus station based on exponential distribution.
You know that the mean of the distribution is 2 mins.
Whats the probability for that in 3 hours more than 100 people will arrive.
So i figured out that the problem is that, we have to calculate the probability of having the actual mean under 1.8 mins.
But i don't really know how to solve this?
Is it something with confidence intervals?
So basically the rate of arrival to get 100 customers in 3hrs will be 1.8 min per customer. Using cumulative distribution function:
Here = 0.5 and t = 1.8. As we are looking for more than 100 customers within 3 hrs so the integral will be from 0 to 1.8.
This gives 1-e^(-0.5*1.8) your answer i.e 0.5934.
You can refer this link to get hold on the theory and few examples.

Verify transmit power to be within certain limits of its expected value over 95% of test measurements

I have a requirement where I have to verify the transmit power out of a device as measured at its connector is within 2 dB of its expected value over 95% of test measurements.
I am using a signal analyzer to analyze the transmitted power. I only get the average power value, min, max and stdDev of the measurements and not the individual power measurements.
Now, the question is how would I verify the "95% thing" using average power, min, max and stdDev. It seems that I can use normal distribution to find the 95% confidence level.
I would appreciate if someone can help me on this.
Thanks in anticipation
The way I'm reading this, it seems you are a statistical beginner, so if I'm wrong there, the rest of this answer will probably be insultingly basic, and I'm sorry.
Anyway, the idea is that if a dataset is normally distributed, and all the observations are independent of one another, then 95% of the data points will fall within 1.96 standard deviations of the mean.
Do you get identical estimates of average power every time you measure, or are there some slight random differences from reading to reading? My guess is that it's the second. If you were to measure the power a whole bunch of times, and each time you plotted your average power value on a histogram, then that histogram of sample means would have the shape of a bell curve. This bell curve of sample means would have its own mean and standard deviation, and if you have thousands or millions of data points going into the calculation of each average power reading, it's not horrible to assume that it is a normal distribution. The explanation for this phenomenon is known as the 'central limit theorem', and I recommend both the Khan academy's presentation of it as well as the wikipedia page on it.
On the other hand, if your average power is the mean of some small number of data points, like for instance n= 5, or n= 30, then assumption of a normal distribution of sample means can be pretty bad. In this case, your 95% confidence interval around the average power goes from qt(0.975,n-1)*SD/sqrt(n) below the average to qt(0.975,n-1)*SD/sqrt(N) above the average, where qt(0.975,n-1) is the 97.5th percentile of the t distribution with n-1 degrees of freedom, and SD is your measured standard deviation.

Resources