Calculate most common time of day from spreadsheet values - excel

Preliminary
This question applies to any spreadsheet system. I would like help in breaking down the problem, as opposed to an answer to the problem. (Although the latter would be most useful.)
I understand Stack Overflow is good for specific programming problems, and I understand it may take me a few attempts to get my question right, so please help me clarify my question by providing suggestions and I will update it.
Like many data novices I have good experience with discreet data (e.g. how many enquiries last month), but I struggle to understand how to deal with continuous data (e.g. how to discover patterns, and where the criteria for a query are not yet known).
The question
I have a spreadsheet where each row represents a "website enquiry". There is a datetime column, and I'd like to discover patterns in this data, to answer questions like:
what is the most common time of day to receive an enquiry
what is the most common day of the week to receive an enquiry
other useful information I can glean from the data, to allow me to target possible customers
This would be similar to the functions you often see in Social Media analytics, such as "best time to tweet".
I understand that calculating the most common day of the week is very simple, as days are discreet objects. So I don't need help with this!
I would like to avoid simply splitting up the day into four arbitrary time periods (e.g. breakfast, lunch, dinner, nighttime) and counting the number of rows that fall into these bounds. What if these time periods are not best to use to segment the data?
Is there another way, other than quantizing my data using arbitrary bounds?

You could use clustering to find out what the most common times are. Basically, you compare the time separation of enquiries and cluster them just like discrete 1D set of numbers using, for example, the average linkage clustering criterion. As you reach a reasonably small number of clusters, you will start to see the most dominant times of day (and if you want to evaluate those, you can take the time values which are the weighted centres of the biggest clusters).

Related

Use Excel to optimise KFC order

Don't laugh, but, from time to time, my friends and I host a multiple-course KFC dinner, and I have a spreadsheet to optimise the order. This is to make sure we order the right combination of 'bucket'-type items (i.e. SKUs that contain multiple pieces, often of different types):
to minimise leftover items
to reduce the total cost
Here is the spreadsheet I currently use, and here's a screenshot:
To use it, you first specify the number of participants in A2, and what you want each person to have in E2:E6 (we're really only interested in the chicken, so 'sides' are treated as generic to simplify).
Here's the manual part, that I'd like to improve.
The next step is to look at the ideal totals for each item (F2:F6), and to try to set the right quantities (H12:H20) of the 'bucket'-type SKUs that I have recreated (A11:G20), so that the output totals (H21:M21) match the ideal totals (F2:F6).
The optimisation part is to get the deltas (H22:M22) as close to zero as possible, and to get the total cost (N21) as low as possible.
So, my question is: is there a way to do this better? I think Excel has some sort of Solver functionality, but I'm afraid I don't know how I'd go about even starting to use that, as my Excel skills are pretty rudimentary. Oh, and in case it makes a difference regarding functionality, I'm using Excel for Mac v16.37.
Any thoughts gratefully appreciated! :)
I can't take any credit for this, but am happy to say that GSerg left me a couple of comments that pointed me in the right direction, and I now have Solver set up to organise my chicken parties!
Here are the parameters for anyone who is curious:

"IF" function for analysis of hospital lab frequency

I work for a hospital that is part of a larger network. We were recently asked by our corporate overlords to address the use of a specific laboratory test. in general, this test should only be performed daily, which should be considered to corresponded to a 24 hour period from last draw. sometimes, however, based on when people arrive to the hospital (e.g. 7pm), and in the interest of bundling labs for a single draw, they may be drawn sooner to coincide with routine testing i.e. 5am. it would never be necessary to otherwise need to repeat within a short (8 hour) window, particularly on the same day.
we have been asked to validate to see if we are adhering to this general practice, as testing any more frequent than that, say, within 12h of a previous test, has no real clinical value and thus adds unnecessary cost.
To address this issue I was given a dataset that among other items includes all instances the lab was performed including collection date and time.
please see HIPPA-safe example below (to be clear, no real data and identifiers are not real); the actual dataset has over 4,174 entries corresponding to 1,328 unique persons. everyone had at least one test performed, not everyone had >1.
I THINK what I want to do is an IF formula that reads the antecedent cell to 1) check if same person and 2) if so, perform a subtraction of the time stamp to display the relevant difference in time, which I can then filter, create histogram, etc. does this seem like a reasonable approach? is there a more preferable method to facilitate analysis? do any other forms of analysis come to mind?
=IF(B2=B1, D2-D1, "n/a")
example data set with formula:
any other forms of analysis come to mind?
By the looks of it you should consider taking the values under "Results" into account, assuming there is a band that might be considered 'normal' readings. The "one in 24 hours is sufficient" rule of thumb may well be appropriate for a series of values within the 'normal' band but not so much so if readings are close to 'danger level'.
That is, in some cases a higher than 'standard' frequency of monitoring may be in the patient's interest, even if not hospital policy, so it may be worth separating the "less than 24 hours interval" readings into those where the higher frequency provided information of little value (eg readings remaining within a 'normal' band) from any that crossed into or out of the band and/or large changes in value. This though may be more a matter of statistical analysis than programming and depend upon whether any action might be taken as a result of such "extra" readings.

How Can I Model Many Short Time Series Samples?

How Can I Model Multiple Short Time Series Samples?
For example, let's say I have a new subject each month, and I measure each subject every day for the entire month. I then want to model these multiple strings of independent time series because I assume that there is an underlying pattern that applies to all 12 subjects. However, a time series with an n of 30 is too short to model, so is there some way to group these 12 time series together for a parallel analysis?
I imagine the way to handle this is similar to how one might handle a time series with multiple breaks of unknown length. Unfortunately, I unaware of how to deal with this type of data structure.
Any thoughts on where to even begin? What terms I should research?
Well. Depends on what you're interested in. Makes it a lot easier if we know what kind of data you have, and what you're trying to analyse.
Trying to answer your question: If you assume that there is some underlying structure which is homogenous for, say, 6 of the subjects, and different for the other half, you can just pool the two data sets and do some kind of group-mean analysis. If you're interested in a temporal change over the 12 months, then you need to assume that each subject are homogenous across whatever variable you're measuring.
Normally, for e.g. timeseries in economics, what you're describing is called "censored" or "truncated data".
If we want to measure the income of everyone in a country, we do this by checking electronic paychecks or something. But some people at the end of each tail, may not have a visible income. Poor people may be earning income in other ways, and rich people may want to hide some of their income. This is censored data, and any advanced timeseries stats book will have something on that.
Truncated data is similar. Just imagine income again. If we truncate everyone who makes < 10,000$ a year, then this will "cut off the end" of your distribution. There are also remedies for this. Again check an advanced time series book.
Hope this helped a bit.

How can I analyze Google Voice data dump?

I use GV for business and have since about 2011. Over that time I've amassed about 10,000 calls with various clients. I'd like to analyze this data to understand things like what days of the week did I have the most calls, what months had the highest call volume, what hour of the day has the highest call volume, et cetera. (Eventually I would also like to compare that to my Google Calendar data to analyze my conversion rates for a given month, but that's step 2)
My question is, is there any easy way to do this short of actually learning to use Excel? Are there any free or relatively cheap statistics programs that will cut some of the work out for me? It's easy enough to clean the data and drop it into Excel, but there are so many intermediary steps between having a good clean data set and actually getting a histogram out of it that it's starting to feel like it isn't worth it.
I have a list of about 10k calls in this format:
Col.A Col.B Col.C
client date 24hr time
I'm not particularly concerned with who the client is... I just want to analyze the second two columns.
Any help at all would be greatly appreciated.

Statistically removing erroneous values

We have a application where users enter prices all day. These prices are recorded in a table with a timestamp and then used for producing charts of how the price has moved... Every now and then the user enters a price wrongly (eg. puts in a zero to many or to few) which somewhat ruins the chart (you get big spikes). We've even put in an extra confirmation dialogue if the price moves by more than 20% but this doesn't stop them entering wrong values...
What statistical method can I use to analyse the values before I chart them to exclude any values that are way different from the rest?
EDIT: To add some meat to the bone. Say the prices are share prices (they are not but they behave in the same way). You could see prices moving significantly up or down during the day. On an average day we record about 150 prices and sometimes one or two are way wrong. Other times they are all good...
Calculate and track the standard deviation for a while. After you have a decent backlog, you can disregard the outliers by seeing how many standard deviations away they are from the mean. Even better, if you've got the time, you could use the info to do some naive Bayesian classification.
That's a great question but may lead to quite a bit of discussion as the answers could be very varied. It depends on
how much effort are you willing to put into this?
could some answers genuinely differ by +/-20% or whatever test you invent? so will there always be need for some human intervention?
and to invent a relevant test I'd need to know far more about the subject matter.
That being said the following are possible alternatives.
A simple test against the previous value (or mean/mode of previous 10 or 20 values) would be straight forward to implement
The next level of complexity would involve some statistical measurement of all values (or previous x values, or values of the last 3 months), a normal or Gaussian distribution would enable you to give each value a degree of certainty as to it being a mistake vs. accurate. This degree of certainty would typically be expressed as a percentage.
See http://en.wikipedia.org/wiki/Normal_distribution and http://en.wikipedia.org/wiki/Gaussian_function there are adequate links from these pages to help in programming these, also depending on the language you're using there are likely to be functions and/or plugins available to help with this
A more advanced method could be to have some sort of learning algorithm that could take other parameters into account (on top of the last x values) a learning algorithm could take the product type or manufacturer into account, for instance. Or even monitor the time of day or the user that has entered the figure. This options seems way over the top for what you need however, it would require a lot of work to code it and also to train the learning algorithm.
I think the second option is the correct one for you. Using standard deviation (a lot of languages contain a function for this) may be a simpler alternative, this is simply a measure of how far the value has deviated from the mean of x previous values, I'd put the standard deviation option somewhere between option 1 and 2
You could measure the standard deviation in your existing population and exclude those that are greater than 1 or 2 standard deviations from the mean?
It's going to depend on what your data looks like to give a more precise answer...
Or graph a moving average of prices instead of the actual prices.
Quoting from here:
Statisticians have devised several methods for detecting outliers. All the methods first quantify how far the outlier is from the other values. This can be the difference between the outlier and the mean of all points, the difference between the outlier and the mean of the remaining values, or the difference between the outlier and the next closest value. Next, standardize this value by dividing by some measure of scatter, such as the SD of all values, the SD of the remaining values, or the range of the data. Finally, compute a P value answering this question: If all the values were really sampled from a Gaussian population, what is the chance of randomly obtaining an outlier so far from the other values? If the P value is small, you conclude that the deviation of the outlier from the other values is statistically significant.
Google is your friend, you know. ;)
For your specific question of plotting, and your specific scenario of an average of 1-2 errors per day out of 150, the simplest thing might be to plot trimmed means, or the range of the middle 95% of values, or something like that. It really depends on what value you want out of the plot.
If you are really concerned with the true max and true of a day's prices, then you have to deal with the outliers as outliers, and properly exclude them, probably using one of the outlier tests previously proposed ( data point is x% more than next point, or the last n points, or more than 5 standard deviations away from the daily mean). Another approach is to view what happens after the outlier. If it is an outlier, then it will have a sharp upturn followed by a sharp downturn.
If however you care about overall trend, plotting daily trimmed mean, median, 5% and 95% percentiles will portray history well.
Choose your display methods and how much outlier detection you need to do based on the analysis question. If you care about medians or percentiles, they're probably irrelevant.

Resources