Adding weight ot variables in a line graph Tableau - statistics

I have a dataset consisting of calls going to agents (atually 10 of them) per day. These agents can either answer calls or transfer them to a call center. What we are interested in is whether each of these agents answers more calls than he transfers. In order to answer this, I have created a variable for each of these agents:
Answered/Transferred
I am using line graph to depict these variables per agent over time.
Now if this variable is less than 1 then this agent transferred more calls than he received. The problem now is that this is not a safe way to measure the overall impact of transferred calls. This is because the traffic pertaining to agents 1,2,3 is far greater than the one pertaining to agents 5,6,7 and so on. Therefore, I am trying to come up with a way to "weight" the variables I have created before. That is, somehow include the total number of calls reaching each agent (irrespectively of whether they are getting transferred or answered) in my calculations. That means that if an agent is getting 5 calls per day while another guy is getting 5.000 per day then I should find a way to depict this in my graphs.
Do you guys have any ideas?

Easiest would be to drag weight measure to colors and choose something like temperature diverging. Depending on your viz you can also drag weight measure to size, and for example, make bars or lines thicker to show there are more records there.

Related

Optimization of resources in Excel

I'm struggling with a task of optimization of resources, and I wonder if someone knows any efficient way to find the optimal solution for it, using only Excel. I explain what I'm trying to achieve:
Suppose you are managing an assembly factory for metal tubes. Your raw material is standard size tubes, and then in your factory you need to cut these tubes according to a list of requests from clients, with very specific sizes. All tubes are of the same type, so we can reuse leftovers from each cut, if the length of that leftover is sufficient to satisfy any tube request.
We can also group small length requests to be made from one single tube, for example, on the attached list, we could use one 8 metre tube to deliver the last four entries (1,615+1,62+1,625+1,67), with 1,47 leftover wasted.
Assuming a long list of requests, and that the tubes supplied are 8 metres each, do you know of any way of calculating how many tubes I have to order to satisfy the list of requests, minimising the losses per each cut?
Example of request list, each entry is in metres

How to parallelize execution of a custom function formula while keeping the Google Sheet shareable and permissionless?

I have a Google Sheet with a custom function formula that: takes in a matrix and two vectors from a spreadsheet, does some lengthy matrix-vector calculations (>30 sec, so above the quota), before outputting the result as a bunch of rows. It is single-threaded, since that's what Google Apps Script (GAS) natively is, but I want to parallelize the calculation using a multi-threading workaround, so it speeds it up drastically.
Requirements (1-3):
UX: It should run the calculations automatically and reactively as a custom function formula, which implies that the user doesn't have to manually start it by clicking a run button or similar. Like my single-threaded version currently does.
Parallelizable: It should ideally spawn ~30 threads/processes, so that instead of taking >30 seconds as it now does (which makes it time out due to Google's quota limit), it should take ~1 second. (I know GAS is single-threaded, but there are workarounds, referenced below).
Shareability: I should ideally be able to share the Sheet with other people, so they can "Make a copy" of it, and the script will still run the calculations for them:
3.1 Permissionless: Without me having to manually hand out individual permissions to users (permissionless). For instance whenever someone "Makes a copy" and "Execute the app as user accessing the web app". My rudimentary testing suggest that this is possible.
3.2 Non-intrusive: Without users of the spreadsheet having to give intrusive authorizations like "Give this spreadsheet/script/app access to your entire Google Drive or Gmail account?". Users having to give an non-intrusive authorization to a script/webapp can be acceptable, as long as requirement 3.1 is still maintained.
3.3 UX: Without forcing users to view a HTML sidebar in the spreadsheet.
I have already read this excellent related answer by #TheMaster which outlines some potential ways of solving parallelization in Google Apps script in general. Workaround #3 google.script.run and workaround #4 UrlFetchApp.fetchAll (both using a Google Web App) looks most promising. But some details are unknown to me, such as if they can adhere to requirements 1 and 3 with its sub-requirements.
I can conceive of an other potential naïve workaround which would be to split the function up into several custom functions formulas and do the parallelization (by some kind of Map/Reduce) inside the spreadsheet itself (storing intermediary results back into the spreadsheet, and having custom function formulas work on that as reducers). But that's undesired, and probably unfeasible, in my case.
I'm very confident my function is parallelizable using some kind of Map/Reduce process. The function is currently optimized by doing all the calculations in-memory, without touching the spreadsheet in-between steps, before finally outputting the result to the spreadsheet. The details of it is quite intricate and well over 100 lines, so I don't want to overload you with more (and potentially confusing) information which doesn't really affect the general applicability of this case. For the context of this question you may assume that my function is parallelizable (and map-reduce'able), or consider any function you already know that would be. What's interesting is what's generally possible to achieve with parallelizationin Google Apps Script, while also maintaining the highest level of shareability and UX. I'll update this question with more details if needed.
Update 2020-06-19:
To be more clear, I do not rule out Google Web App workarounds entirely, as I haven't got experience with their practical limitations to know for sure if they can solve the problem within the requirements. I have updated the sub-requirements 3.1 and 3.2 to reflect this. I also added sub-req 3.3, to be clearer on the intent. I also removed req 4, since it was largely overlapping with req 1.
I also edited the question and removed the related sub-questions, so it is more focused on the single main HOWTO-question in the title. The requirements in my question should provide a clear objective standard for which answers would be considered best.
I realise the question might entail a search for the Holy Grail of Google Sheet multithreading workarounds, as #TheMaster has pointed out in private. Ideally, Google would provide one or more features to support multithreading, map-reduce, or more permissionless sharing. But until then I would really like to know what is the optimal workaround within the current constraints we have. I would hope this question is relevant to others as well, even considering the tight requirements.
If you publish a web-app with "anyone, even anonymous", execute as "Me", then the custom function can use UrlFetchApp.fetchAllAuthorization not needed to post to that web-app. This will run in parallelproof. This solves all the three requirements.
Caveat here is: If multiple people use the sheet, and the custom function will have to post to the "same" webapp (that you published to execute as you) for processing, Google will limit simultaneous executionsquota limit:30.
To workaround this, You can ask people using your sheet to publish their own web-apps. They'll have to do this once at the beginning and no authorization is needed.
If not, you'll need to host a custom server for the load or something like google-cloud-functions might help
I ended up using the naïve workaround that I mentioned in my post:
I can conceive of an other potential naïve workaround which would be
to split the function up into several custom functions formulas and do
the parallelization (by some kind of Map/Reduce) inside the
spreadsheet itself (storing intermediary results back into the
spreadsheet, and having custom function formulas work on that as
reducers). But that's undesired, and probably unfeasible, in my case.
I initially disregarded it because it involves having an extra sheet tab with calculations which was not ideal. But when I reflected on it after investigating alternative solutions, it actually solves all the stated requirements in the most non-intrusive manner. Since it doesn't require anything extra from users the spreadsheet is shared with. It also stays 'within' Google Sheets as far as possible (no semi- or fully external Web App needed), doing the parallelization by relying on the native parallelization of concurrently executing spreadsheet cells, where results can be chained, and appear to the user like using regular formulas (no extra menu item or run-this-script-buttons necessary).
So I implemented MapReduce in Google Sheets using custom functions each operating on a slice of the interval I wanted to calculate. The reason I was able to do that, in my case, was that the input to my calculation was divisible into intervals that could each be calculated separately, and then joined later.**
Each parallel custom function then takes in one interval, calculates that, and outputs the results back to the sheet (I recommend to output as rows instead of columns, since columns are capped at 18 278 columns max. See this excellent post on Google Spreadsheet limitations.) I did run into the only 40,000 new rows at a time limitation, but was able to perform some reducing on each interval, so that they only output a very limited amount of rows to the spreadsheet. That was the parallelization; the Map part of MapReduce. Then I had a separate custom function which did the Reduce part, namely: dynamically target*** the spreadsheet output area of the separately calculated custom functions, and take in their results, once available, and join them together while further reducing them (to find the best performing results), to return the final result.
The interesting part was that I thought I would hit the only 30 simultaneous execution quota limit of Google Sheets. But I was able to parallelize up to 64 independently and seemingly concurrently executing custom functions. It may be that Google puts these into a queue if they exceed 30 concurrent executions, and only actually process 30 of them at any given time (please comment if you know). But anyhow, the parallelization benefit/speedup was huge, and seemingly nearly infinitely scalable. But with some caveats:
You have to define the number of parallelised custom functions up front, manually. So the parallelization doesn't infinitely auto-scale according to demand****. This is important because of the counter-intuitive result that in some cases using less parallelization actually executes faster. In my case, the result set from a very small interval could be exceedingly large, while if the interval had been larger then a lot of the results would have been ruled out underway in the algorithm in that parallelised custom function (i.e. the Map also did some reduction).
In rare cases (with huge inputs), the Reducer function will output a result before all of the parallel (Map) functions have completed (since some of them seemingly take too long). So you seemingly have a complete result set, but then a few seconds later it will re-update when the last parallel function returns its result. This is not ideal, so to be notified of this I implemented a function to tell me if the result was valid. I put it in the cell above the Reduce function (and colored the text red). B6 is the number of intervals (here 4), and the other cell references go to the cell with the custom function for each interval: =didAnyExecutedIntervalFail($B$6,S13,AB13,AK13,AT13)
function didAnyExecutedIntervalFail(intervalsExecuted, ...intervalOutputs) {
const errorValues = new Set(["#NULL!", "#DIV/0!", "#VALUE!", "#REF!", "#NAME?", "#NUM!", "#N/A","#ERROR!", "#"]);
// We go through only the outputs for intervals which were included in the parallel execution.
for(let i=0; i < intervalsExecuted; i++) {
if (errorValues.has(intervalOutputs[i]))
return "Result below is not valid (due to errors in one or more of the intervals), even though it looks like a proper result!";
}
}
The parallel custom functions are limited by Google quota of max 30 sec execution time for any custom function. So if they take too long to calculate, they still might time out (causing the issue mentioned in the previous point). The way to alleviate this timeout is to parallelise more, dividing into more intervals, so that each parallel custom function runs below 30 second.
The output of it all is limited by Google Sheet limitations. Specifically max 5M cells in a spreadsheet. So you may need to perform some reduction on the size of the results calculated in each parallel custom function, before returning its result to the spreadsheet. So that they each are below 40 000 rows, otherwise you'll receive the dreaded "Results too large" error). Furthermore, depending on the size the result of each parallel custom function, it would also limit how many custom functions you could have at the same time, as they and their result cells take space in the spreadsheet. But if each of them take in total, say 50 cells (including a very small output), then you could still parallelize pretty much (5M / 50 = 100 000 parallel functions) within a single sheet. But you also need some space for whatever you want to do with those results. And the 5M cells limit is for the whole Spreadsheet in total, not just for one of its sheet tabs, apparently.
** For those interested: I basically wanted to calculate all combinations of a sequence of bits (by brute force), so the function was 2^n where n was the number of bits. The initial range of combinations was from 1 to 2^n, so it could be divided into intervals of combinations, for example, if dividing into two intervals, it would be one from 1 to X and then one from X+1 to 2^n.
*** For those interested: I used a separate sheet formula to dynamically determine the range for the output of one of the intervals, based on the presence of rows with content. It was in a separate cell for each interval. For the first interval it was in cell S11 and the formula looked like this:
=ADDRESS(ROW(S13),COLUMN(S13),4)&":"&ADDRESS(COUNTA(S13:S)+ROWS(S1:S12),COLUMN(Z13),4) and it would output S13:Z15 which is the dynamically calculated output range, which only counts those rows with content (using COUNTA(S13:S)), thus avoiding to have a statically determined range. Since with a normal static range the size of the output would have to be known in advance, which it wasn't, or it would possibly either not include all of the output, or a lot of empty rows (and you don't want the Reducer to iterate over a lot of essentially empty data structures). Then I would input that range into the Reduce function by using INDIRECT(S$11). So that's how you get the results, from one of the intervals processed by a parallelized custom function, into the main Reducer function.
**** Though you could make it auto-scale up to some pre-defined amount of parallelised custom functions. You could use some preconfigured thresholds, and divide into, say, 16 intervals in some cases, but in other cases automatically divide into 64 intervals (preconfigured, based on experience). You'd then just stop / short-circuit the custom functions which shouldn't participate, based on if the number of that parallelised custom function exceeds the number of intervals you want to divide into and process. On the first line in the parallelised custom function: if (calcIntervalNr > intervals) return;. Though you would have to set up all the parallel custom functions in advance, which can be tedious (remember you have to account for the output area of each, and are limited by the max cell limit of 5M cells in Google Sheets).

Plot a Line Graph

There is a scenario where I have to create a line graph in excel or anywhere which represents two approaches. The parameters involved are:
The number of systems on which both old and new approach is tested.
The number of calls made on each system.
Amount of time taken for each call.
Can you please help me how do I represent this in tabular format so that I can create a line graph with one axis as time taken and other one as System/Calls made:
System1
============================
Call 1:
--------------
Old Approach:
That took 15.658 seconds.
New Approach:
That took 8.076 seconds.
Call 2:
--------------
Old Approach:
That took 10.054 seconds.
New Approach:
That took 5.65 seconds.
System2
============================
Call 1:
--------------
Old Approach:
That took 13.775 seconds.
New Approach:
That took 4.542 seconds.
Call 2:
--------------
Old Approach:
That took 10.097 seconds.
New Approach:
That took 4.0200000000000005 seconds.
I think you actually want a bar chart if you want to see the time difference between old and new approaches under each system (using the axes layout you describe).
I will leave images of both so you can see what i mean. From a data layout point of you, it is often best to create a flat table where each column is a distinct attribute and of the same datatype. I.e. there are not mixed datatypes or different measures within the same column. I would recommend reading parts of chapter 2, as a minimum, of Hadley Wickam's paper on Tidy Data. You don't need to understand the R language to get a lot out of chapter 2.
If you have multiple repeated measurements you could plot the average rather than the sum, or even better would be the median as less prone to skew i.e. less biased by a very long or short call present in the data set.

How to predict when next event occurs based on previous events? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
Basically, I have a reasonably large list (a year's worth of data) of times that a single discrete event occurred (for my current project, a list of times that someone printed something). Based on this list, I would like to construct a statistical model of some sort that will predict the most likely time for the next event (the next print job) given all of the previous event times.
I've already read this, but the responses don't exactly help out with what I have in mind for my project. I did some additional research and found that a Hidden Markov Model would likely allow me to do so accurately, but I can't find a link on how to generate a Hidden Markov Model using just a list of times. I also found that using a Kalman filter on the list may be useful but basically, I'd like to get some more information about it from someone who's actually used them and knows their limitations and requirements before just trying something and hoping it works.
Thanks a bunch!
EDIT: So by Amit's suggestion in the comments, I also posted this to the Statistics StackExchange, CrossValidated. If you do know what I should do, please post either here or there
I'll admit it, I'm not a statistics kind of guy. But I've run into these kind of problems before. Really what we're talking about here is that you have some observed, discrete events and you want to figure out how likely it is you'll see them occur at any given point in time. The issue you've got is that you want to take discrete data and make continuous data out of it.
The term that comes to mind is density estimation. Specifically kernel density estimation. You can get some of the effects of kernel density estimation by simple binning (e.g. count the number events in a time interval such as every quarter hour or hour.) Kernel density estimation just has some nicer statistical properties than simple binning. (The produced data is often 'smoother'.)
That only takes care of one of your problems, though. The next problem is still the far more interesting one -- how do you take a time line of data (in this case, only printer data) and produced a prediction from it? First thing's first -- the way you've set up the problem may not be what you're looking for. While the miracle idea of having a limited source of data and predicting the next step of that source sounds attractive, it's far more practical to integrate more data sources to create an actual prediction. (e.g. maybe the printers get hit hard just after there's a lot of phone activity -- something that can be very hard to predict in some companies) The Netflix Challenge is a rather potent example of this point.
Of course, the problem with more data sources is that there's extra legwork to set up the systems that collect the data then.
Honestly, I'd consider this a domain-specific problem and take two approaches: Find time-independent patterns, and find time-dependent patterns.
An example time-dependent pattern would be that every week day at 4:30 Suzy prints out her end of the day report. This happens at specific times every day of the week. This kind of thing is easy to detect with fixed intervals. (Every day, every week day, every weekend day, every Tuesday, every 1st of the month, etc...) This is extremely simple to detect with predetermined intervals -- just create a curve of the estimated probability density function that's one week long and go back in time and average the curves (possibly a weighted average via a windowing function for better predictions).
If you want to get more sophisticated, find a way to automate the detection of such intervals. (Likely the data wouldn't be so overwhelming that you could just brute force this.)
An example time-independent pattern is that every time Mike in accounting prints out an invoice list sheet, he goes over to Johnathan who prints out a rather large batch of complete invoice reports a few hours later. This kind of thing is harder to detect because it's more free form. I recommend looking at various intervals of time (e.g. 30 seconds, 40 seconds, 50 seconds, 1 minute, 1.2 minutes, 1.5 minutes, 1.7 minutes, 2 minutes, 3 minutes, .... 1 hour, 2 hours, 3 hours, ....) and subsampling them via in a nice way (e.g. Lanczos resampling) to create a vector. Then use a vector-quantization style algorithm to categorize the "interesting" patterns. You'll need to think carefully about how you'll deal with certainty of the categories, though -- if your a resulting category has very little data in it, it probably isn't reliable. (Some vector quantization algorithms are better at this than others.)
Then, to create a prediction as to the likelihood of printing something in the future, look up the most recent activity intervals (30 seconds, 40 seconds, 50 seconds, 1 minute, and all the other intervals) via vector quantization and weight the outcomes based on their certainty to create a weighted average of predictions.
You'll want to find a good way to measure certainty of the time-dependent and time-independent outputs to create a final estimate.
This sort of thing is typical of predictive data compression schemes. I recommend you take a look at PAQ since it's got a lot of the concepts I've gone over here and can provide some very interesting insight. The source code is even available along with excellent documentation on the algorithms used.
You may want to take an entirely different approach from vector quantization and discretize the data and use something more like a PPM scheme. It can be very much simpler to implement and still effective.
I don't know what the time frame or scope of this project is, but this sort of thing can always be taken to the N-th degree. If it's got a deadline, I'd like to emphasize that you worry about getting something working first, and then make it work well. Something not optimal is better than nothing.
This kind of project is cool. This kind of project can get you a job if you wrap it up right. I'd recommend you do take your time, do it right, and post it up as function, open source, useful software. I highly recommend open source since you'll want to make a community that can contribute data source providers in more environments that you have access to, will to support, or time to support.
Best of luck!
I really don't see how a Markov model would be useful here. Markov models are typically employed when the event you're predicting is dependent on previous events. The canonical example, of course, is text, where a good Markov model can do a surprisingly good job of guessing what the next character or word will be.
But is there a pattern to when a user might print the next thing? That is, do you see a regular pattern of time between jobs? If so, then a Markov model will work. If not, then the Markov model will be a random guess.
In how to model it, think of the different time periods between jobs as letters in an alphabet. In fact, you could assign each time period a letter, something like:
A - 1 to 2 minutes
B - 2 to 5 minutes
C - 5 to 10 minutes
etc.
Then, go through the data and assign a letter to each time period between print jobs. When you're done, you have a text representation of your data, and that you can run through any of the Markov examples that do text prediction.
If you have an actual model that you think might be relevant for the problem domain, you should apply it. For example, it is likely that there are patterns related to day of week, time of day, and possibly date (holidays would presumably show lower usage).
Most raw statistical modelling techniques based on examining (say) time between adjacent events would have difficulty capturing these underlying influences.
I would build a statistical model for each of those known events (day of week, etc), and use that to predict future occurrences.
I think the predictive neural network would be a good approach for this task.
http://en.wikipedia.org/wiki/Predictive_analytics#Neural_networks
This method is also used for predicting f.x. weather forecasting, stock marked, sun spots.
There's a tutorial here if you want to know more about how it works.
http://www.obitko.com/tutorials/neural-network-prediction/
Think of a markov chain like a graph with vertex connect to each other with a weight or distance. Moving around this graph would eat up the sum of the weights or distance you travel. Here is an example with text generation: http://phpir.com/text-generation.
A Kalman filter is used to track a state vector, generally with continuous (or at least discretized continuous) dynamics. This is sort of the polar opposite of sporadic, discrete events, so unless you have an underlying model that includes this kind of state vector (and is either linear or almost linear), you probably don't want a Kalman filter.
It sounds like you don't have an underlying model, and are fishing around for one: you've got a nail, and are going through the toolbox trying out files, screwdrivers, and tape measures 8^)
My best advice: first, use what you know about the problem to build the model; then figure out how to solve the problem, based on the model.

Find UK PostCodes closest to other UK Post Codes by matching the Post Code String

Here is a question that has me awake for a number of days now. The only conclusion I came up so far is that Red Bull does not usually help coders.
I have a scenario in my application where I have a couple of jobs (1 to 50). The job has an address and I have the following properties of an address: Postcode, Latitude, and Longitude.
I have a table of workers also and they too have addresses. While the jobs or workers are created through screens, I use Google Map queries to make sure the provided Postcode is valid and is in UK so all the addresses are verified.
I am using a scheduler control to display some workers on y-axis and a timeline on x-axis. Every job has a date and can only move vertically on the scheduler on the job’s date. The user selects a number of jobs and they are displayed in a basket close to the scheduler. The user can then drag and drop job against workers. All this is manual so it works.
My task is to automate this so that the user does not do much except just verifying and allotting the jobs. Therefore, I have to automate the process.
Every worker has a property called WillingMaximumDistanceTravel which is an integer representing miles, the worker is willing to travel for a job.
Now here is the headache: I have over 1500 workers. I have a utility function that uses Newtonsoft’s Json Convert to de-serialize a stream of response from Google Maps. I need to feed it Postcode A and B.
I also plan to introduce a new table to DB to store the distance finds as Postcode A, Postcode B, and Distance. Therefore, if I find myself comparing the same postcodes again, I will just retrieve the result from DB instead and slowly and eventually, I would no longer require bothering Google anymore as this table would be very comprehensive.
I cannot use the simple Haversine formula, as Crow-fly path is not my requirement here. The pain in this is that it takes a lot of time to calculate. Some workers can travel over 10 miles while some vary from 15 to 80. I have to take the first job from the list and run it with every applicable worker o the system! I was wondering that the UK postcode has a pattern to it. If we sort a list of UK postcodes, can we rough-estimate, from the alphanumeric pattern, where will we hit a 100-mile mark, a 200-mile mark and so on?
If anyone is interested in the code, please drop a line and I will paste it.
(I work for Google, but I'm not speaking on behalf of Google. I have nothing to do with the maps API.)
I suspect this isn't a great situation for using the Google Maps API, simply because you're pushing so much data through. You really don't want to make that many requests, even if you could do so under the directions limits.
When I tackled something similar in a previous job, we bought into a locally-hosted maps API - but even that wasn't fast enough for this sort of work. We ended up precomputing the time to travel from the centroid of each postcode "area" (probably the wrong name for it, but the first part of the postcode followed by the first digit of the remainder, e.g. "SW1W 9" for "SW1W 9TQ") to every other area, storing the result in a giant table. I think we only did it for postcodes which were within 100 miles or something similar, to cut down on the amount of preprocessing.
Even then, a simple DB wasn't quite as fast as we wanted - so we stored the results in a giant file, with a single byte per source/destination pair. (We had a fixed sequence of source postcodes and target postcodes, so we didn't need to specify those.) At that point, computing a travel time consisted of:
Work out postcode areas (substring work)
Find the index of each postcode area within the sequence
Check if we'd loaded that part of the file (we lazy loaded for startup speed)
Load the row if necessary, and just access it otherwise
The bytes were on a sliding scale of accuracy, so for the first 60 minutes it was on a per-minute basis, then each extra value meant an extra 2 minutes, then 5 etc. (Those aren't the exact values, but it was something like that.)
When you've worked out "good candidates" you can ask an on-site API or the Google Maps API for more accurate directions for your exact postcodes, of course.
You want to look for a spatial-index or a space-filling-curve. A spatial index reduce the 2d problem to a 1d problem and recursivley subdivide the surface into smaller tiles but it is basically a reordering of the tiles. You can subdivide the surface either with an index or a string using 4 characters. The latter one can be useful to you because it let you query the string with all string operation hidden in the database engine. You want to look for Nick's spatial index quadtree hilbert-curve blog.

Resources