Performing T-Test on Time Series - statistics

My boss asked me to perform a T-Test to test the significance for a certain metric we use called conversion rate.
I have collected 18 months worth of data for this metric dating April 1, 2017 - September 30th, 2018.
He initially told me to collect 12 - 14 months of the data and run a t-test to to look for significance of the metric. (Higher conversion rate means better!).
I'm not really sure how to go about it. Do I split the data up into 9 month samples i.e. Sample 1: April 2017 - December 2017, Sample 2: January 2018 - September 2018 and run a two sample t-test? Or would it make sense to compare all of the data against a mean like 0?
Is there a better approach to this? The bottom line is he wants to see that the conversion rate has significantly increased over time.
Thanks,
- Keith

My advice is to dump the t-test and look only at the magnitude of the change in the conversion rate. After all, the conversion rate is what's important to your business. By the way, looking at the magnitude of something practically relevant is called "effect size analysis"; a web search for that should turn up a lot of resources. To get started, just make a plot of the available data -- is conversion rate going up or going down or what?
Further questions should be directed to stats.stackexchange.com instead of SO. Good luck and have fun.

Related

How to set a benchmark based on data

I have a huge dataset of people's daily productivity for X period of time, lets say from Jan 2022 to Aug 2022. And you have to identify a standard productivity that all of them should achieve that is nominal and achievable based on the data. What test would you use, and how would you identify the standard benchmark ?

Mobile Data Analysis in Excel

I collected the mobile data consumption using DATA USAGE in android. Spread over days of the weeks (Monday to Sunday), I want to analyse two apps i.e. Facebook and Messenger, to check whether there was a significant data usage difference depending on the days of the weeks. Should I be using t-test or some other method? What's the best method that can be used in excel to analyse this.
P.s. Help will be much appreciated. Thanks
If you believe your data is normally distributed then statistically speaking it sounds like you're going to want to use the t-test method. You don't know the population's standard deviation so that would be my choice. However, this data should be taken over at least 30 weeks if you want the data for each weekday to be somewhat accurate.

How to calculate exact anniversary of sinking of the Titanic

yes this is an iOS programming question.
I need to calculate the exact time the RMS Titanic sank for its 100 yr anniversary.
It sunk at 15th April 1912 2:20 a.m.
Stupid Question you say. 100th year anniversary is
15th April 1912 2:20 a.m.
+100 years
15th April 2012 2:20 a.m.
But I want an alarm to go off in any timezone in the world exactly and need to handle timezones, and things like British Summer Time being one hour ahead yet it didnt come into effect till 1918.... but as of today 25 mar 2012 London IS in BST so one hour ahead.
Im confused on timezones. We have GMT and UTC. The ships sank at 49° 56' 49" W, 41° 43' 32" N so few hours ahead of London.
-49.94822196927015
41.72713826043066
Whats the correct way to enter a historic date into NSCalendar
and to add 100 years to it exactly and get back the right time
in the users current timezone?
I notice theres Japanese and Islamic calender formats in NSCalendar options. Can iOS device change their dates to these calendars?
And if this was the case. how would I convert from Gregorian to say Islamic?
nice brain puzzler to start your week :)
(There's no code here, as I've never used the iOS API. However, I have some experience of date/time APIs and the oddities they throw up, so I hope you find this answer useful anyway.)
What's the correct way to enter a historic date into NSCalendar
and to add 100 years to it exactly and get back the right time
in the users current timezone?
It really depends on what you mean by "100 years" - it's not like that's a fixed amount of time really.
I would take the UTC instant at which it sank, apply the user's local time zone, and then add a hundred years to that. However, you then need to consider that the result may not actually be a valid local time in that time zone.
For example, suppose you're in a time zone where at the instant of the sinking, it was 1:20am... but in that time zone, 15th April 2012 is when the clocks change - and they skip from 1am to 2am. In that case, 1:20am never occurs... so you could potentially pick 12:20am, the instant before the DST transition, the instant of the DST transition, or 2:20am, depending on what you think is appropriate.
Another possibility to consider is the opposite - suppose it's a DST transition which goes from 2am to 1am... so 1:20am actually occurs twice. What would you want to do in that case? You probably shouldn't make your app celebrate the anniversary twice!
Another option which removes this possibility of ambiguity and discrepancy is to work out what the offset from UTC in the user's time zone was at the exact time of the sinking, then add 100 years to the UTC value (which will never have any DST transitions) and apply the same offset again.
I notice theres Japanese and Islamic calender formats in NSCalendar options. Can iOS device change their dates to these calendars? And if this was the case. how would I convert from Gregorian to say Islamic?
I don't know on that front, I'm afraid.

How to predict when next event occurs based on previous events? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
Basically, I have a reasonably large list (a year's worth of data) of times that a single discrete event occurred (for my current project, a list of times that someone printed something). Based on this list, I would like to construct a statistical model of some sort that will predict the most likely time for the next event (the next print job) given all of the previous event times.
I've already read this, but the responses don't exactly help out with what I have in mind for my project. I did some additional research and found that a Hidden Markov Model would likely allow me to do so accurately, but I can't find a link on how to generate a Hidden Markov Model using just a list of times. I also found that using a Kalman filter on the list may be useful but basically, I'd like to get some more information about it from someone who's actually used them and knows their limitations and requirements before just trying something and hoping it works.
Thanks a bunch!
EDIT: So by Amit's suggestion in the comments, I also posted this to the Statistics StackExchange, CrossValidated. If you do know what I should do, please post either here or there
I'll admit it, I'm not a statistics kind of guy. But I've run into these kind of problems before. Really what we're talking about here is that you have some observed, discrete events and you want to figure out how likely it is you'll see them occur at any given point in time. The issue you've got is that you want to take discrete data and make continuous data out of it.
The term that comes to mind is density estimation. Specifically kernel density estimation. You can get some of the effects of kernel density estimation by simple binning (e.g. count the number events in a time interval such as every quarter hour or hour.) Kernel density estimation just has some nicer statistical properties than simple binning. (The produced data is often 'smoother'.)
That only takes care of one of your problems, though. The next problem is still the far more interesting one -- how do you take a time line of data (in this case, only printer data) and produced a prediction from it? First thing's first -- the way you've set up the problem may not be what you're looking for. While the miracle idea of having a limited source of data and predicting the next step of that source sounds attractive, it's far more practical to integrate more data sources to create an actual prediction. (e.g. maybe the printers get hit hard just after there's a lot of phone activity -- something that can be very hard to predict in some companies) The Netflix Challenge is a rather potent example of this point.
Of course, the problem with more data sources is that there's extra legwork to set up the systems that collect the data then.
Honestly, I'd consider this a domain-specific problem and take two approaches: Find time-independent patterns, and find time-dependent patterns.
An example time-dependent pattern would be that every week day at 4:30 Suzy prints out her end of the day report. This happens at specific times every day of the week. This kind of thing is easy to detect with fixed intervals. (Every day, every week day, every weekend day, every Tuesday, every 1st of the month, etc...) This is extremely simple to detect with predetermined intervals -- just create a curve of the estimated probability density function that's one week long and go back in time and average the curves (possibly a weighted average via a windowing function for better predictions).
If you want to get more sophisticated, find a way to automate the detection of such intervals. (Likely the data wouldn't be so overwhelming that you could just brute force this.)
An example time-independent pattern is that every time Mike in accounting prints out an invoice list sheet, he goes over to Johnathan who prints out a rather large batch of complete invoice reports a few hours later. This kind of thing is harder to detect because it's more free form. I recommend looking at various intervals of time (e.g. 30 seconds, 40 seconds, 50 seconds, 1 minute, 1.2 minutes, 1.5 minutes, 1.7 minutes, 2 minutes, 3 minutes, .... 1 hour, 2 hours, 3 hours, ....) and subsampling them via in a nice way (e.g. Lanczos resampling) to create a vector. Then use a vector-quantization style algorithm to categorize the "interesting" patterns. You'll need to think carefully about how you'll deal with certainty of the categories, though -- if your a resulting category has very little data in it, it probably isn't reliable. (Some vector quantization algorithms are better at this than others.)
Then, to create a prediction as to the likelihood of printing something in the future, look up the most recent activity intervals (30 seconds, 40 seconds, 50 seconds, 1 minute, and all the other intervals) via vector quantization and weight the outcomes based on their certainty to create a weighted average of predictions.
You'll want to find a good way to measure certainty of the time-dependent and time-independent outputs to create a final estimate.
This sort of thing is typical of predictive data compression schemes. I recommend you take a look at PAQ since it's got a lot of the concepts I've gone over here and can provide some very interesting insight. The source code is even available along with excellent documentation on the algorithms used.
You may want to take an entirely different approach from vector quantization and discretize the data and use something more like a PPM scheme. It can be very much simpler to implement and still effective.
I don't know what the time frame or scope of this project is, but this sort of thing can always be taken to the N-th degree. If it's got a deadline, I'd like to emphasize that you worry about getting something working first, and then make it work well. Something not optimal is better than nothing.
This kind of project is cool. This kind of project can get you a job if you wrap it up right. I'd recommend you do take your time, do it right, and post it up as function, open source, useful software. I highly recommend open source since you'll want to make a community that can contribute data source providers in more environments that you have access to, will to support, or time to support.
Best of luck!
I really don't see how a Markov model would be useful here. Markov models are typically employed when the event you're predicting is dependent on previous events. The canonical example, of course, is text, where a good Markov model can do a surprisingly good job of guessing what the next character or word will be.
But is there a pattern to when a user might print the next thing? That is, do you see a regular pattern of time between jobs? If so, then a Markov model will work. If not, then the Markov model will be a random guess.
In how to model it, think of the different time periods between jobs as letters in an alphabet. In fact, you could assign each time period a letter, something like:
A - 1 to 2 minutes
B - 2 to 5 minutes
C - 5 to 10 minutes
etc.
Then, go through the data and assign a letter to each time period between print jobs. When you're done, you have a text representation of your data, and that you can run through any of the Markov examples that do text prediction.
If you have an actual model that you think might be relevant for the problem domain, you should apply it. For example, it is likely that there are patterns related to day of week, time of day, and possibly date (holidays would presumably show lower usage).
Most raw statistical modelling techniques based on examining (say) time between adjacent events would have difficulty capturing these underlying influences.
I would build a statistical model for each of those known events (day of week, etc), and use that to predict future occurrences.
I think the predictive neural network would be a good approach for this task.
http://en.wikipedia.org/wiki/Predictive_analytics#Neural_networks
This method is also used for predicting f.x. weather forecasting, stock marked, sun spots.
There's a tutorial here if you want to know more about how it works.
http://www.obitko.com/tutorials/neural-network-prediction/
Think of a markov chain like a graph with vertex connect to each other with a weight or distance. Moving around this graph would eat up the sum of the weights or distance you travel. Here is an example with text generation: http://phpir.com/text-generation.
A Kalman filter is used to track a state vector, generally with continuous (or at least discretized continuous) dynamics. This is sort of the polar opposite of sporadic, discrete events, so unless you have an underlying model that includes this kind of state vector (and is either linear or almost linear), you probably don't want a Kalman filter.
It sounds like you don't have an underlying model, and are fishing around for one: you've got a nail, and are going through the toolbox trying out files, screwdrivers, and tape measures 8^)
My best advice: first, use what you know about the problem to build the model; then figure out how to solve the problem, based on the model.

Rounding Standards - Financial Calculations

I am curious about the existence of any "rounding" standards" when it comes to the calculation of financial data. My initial thoughts are to perform rounding only when the data is being presented to the user (presentation layer).
If "rounded" data is then used for further calculations, should be use the "rounded" figure or the "raw" figure? Does anyone have any advice?
Please note that I am aware of different rounding methods, i.e. Bankers Rounding etc.
The first and most important rule: use a decimal data type, never ever binary floating-point types.
When exactly rounding should be performed can be mandated by regulations, such as the conversion between the Euro and national currencies it replaced.
If there are no such rules, I'd do all calculations with high precision, and round only for presentation, i.e. not use rounded values for further calculations. This should yield the best overall precision.
I just asked a greybeard mainframe programmer at the financial software company I work for, and he said there is no well-known standard and it's up to programmer practice.
While statisticians have been aware of the rounding issue since at least 1906, it's difficult to find a financial standard endorsing it.
According to this site, the "European Commission report The Introduction of the Euro and the Rounding of Currency Amounts suggests that there had previously been no standard approach to rounding in banking."
In general, use a symmetric rounding mode no matter what base you are working in (base-2 or base-10).
This will avoid systematic bias during calculations.
Such a mode is Round-Half-To-Even, otherwise known as "bankers rounding".
Use language tools that allow you to specify the numeric context explicity, including the rounding and truncation modes. For example, Python's decimal module. The implicit assumptions made by the C library might not be appropriate for your computations.
http://en.wikipedia.org/wiki/Rounding#Rounding_to_integer
It's frustrating that there aren't clear standards on this, both to guide the programmer, and as a defense in court. Just doing "regular" rounding toward nearest for payroll can lead to underpayment by a few pennies on a paycheck here and there, which is something labor lawyers eat up like crack.
Though a base pay rate may well only be specified in two decimal places ("You're hired at $22.71/hour"), things like blended overtime (determined by averaging multiple pay rates in a period) end up with an effective hourly rate of $23.37183475/hr.
How do you pay overtime on that?
15 hours x 23.37183475 x 1.5 = $525.87 rounded from $525.86628187
15 hours x 23.37 x 1.5 = $525.82
WHY DID YOU STEAL FIVE CENTS FROM MY CLIENT? Sadly, I'm not joking about this.
This gets even more uncomfortable when you calculate at the full precision value but display a truncated version: you do the first calculation above, but only display $23.37 for the rate on the pay stub.
Now the pay stub calculations don't tie out to the penny, and now you have to explain it, but even if it's in the employee's favor, it can be enough for a labor lawyer to smell blood in the water and start looking for other stuff.
One approach is to always round in favor of the employee, not in the natural direction, so there cannot ever be an accusation of systematic wage theft.
Ive not seen the existence of "the one standard to rule them all" - there are any number of rounding rules (as you have referenced), and they seem to come into play based on industry/customer/and currency code (http://en.wikipedia.org/wiki/ISO_4217) - since not everyone uses 2 places after the decimal, the problem becomes even more complicated. At the end of the day, your customer needs to specify the rules they want to implement...
Consider using scaled integers.
In other words, store whole numbers of pennies instead of fractional numbers of dollars.

Resources