I have a Google Sheet with a custom function formula that: takes in a matrix and two vectors from a spreadsheet, does some lengthy matrix-vector calculations (>30 sec, so above the quota), before outputting the result as a bunch of rows. It is single-threaded, since that's what Google Apps Script (GAS) natively is, but I want to parallelize the calculation using a multi-threading workaround, so it speeds it up drastically.
Requirements (1-3):
UX: It should run the calculations automatically and reactively as a custom function formula, which implies that the user doesn't have to manually start it by clicking a run button or similar. Like my single-threaded version currently does.
Parallelizable: It should ideally spawn ~30 threads/processes, so that instead of taking >30 seconds as it now does (which makes it time out due to Google's quota limit), it should take ~1 second. (I know GAS is single-threaded, but there are workarounds, referenced below).
Shareability: I should ideally be able to share the Sheet with other people, so they can "Make a copy" of it, and the script will still run the calculations for them:
3.1 Permissionless: Without me having to manually hand out individual permissions to users (permissionless). For instance whenever someone "Makes a copy" and "Execute the app as user accessing the web app". My rudimentary testing suggest that this is possible.
3.2 Non-intrusive: Without users of the spreadsheet having to give intrusive authorizations like "Give this spreadsheet/script/app access to your entire Google Drive or Gmail account?". Users having to give an non-intrusive authorization to a script/webapp can be acceptable, as long as requirement 3.1 is still maintained.
3.3 UX: Without forcing users to view a HTML sidebar in the spreadsheet.
I have already read this excellent related answer by #TheMaster which outlines some potential ways of solving parallelization in Google Apps script in general. Workaround #3 google.script.run and workaround #4 UrlFetchApp.fetchAll (both using a Google Web App) looks most promising. But some details are unknown to me, such as if they can adhere to requirements 1 and 3 with its sub-requirements.
I can conceive of an other potential naïve workaround which would be to split the function up into several custom functions formulas and do the parallelization (by some kind of Map/Reduce) inside the spreadsheet itself (storing intermediary results back into the spreadsheet, and having custom function formulas work on that as reducers). But that's undesired, and probably unfeasible, in my case.
I'm very confident my function is parallelizable using some kind of Map/Reduce process. The function is currently optimized by doing all the calculations in-memory, without touching the spreadsheet in-between steps, before finally outputting the result to the spreadsheet. The details of it is quite intricate and well over 100 lines, so I don't want to overload you with more (and potentially confusing) information which doesn't really affect the general applicability of this case. For the context of this question you may assume that my function is parallelizable (and map-reduce'able), or consider any function you already know that would be. What's interesting is what's generally possible to achieve with parallelizationin Google Apps Script, while also maintaining the highest level of shareability and UX. I'll update this question with more details if needed.
Update 2020-06-19:
To be more clear, I do not rule out Google Web App workarounds entirely, as I haven't got experience with their practical limitations to know for sure if they can solve the problem within the requirements. I have updated the sub-requirements 3.1 and 3.2 to reflect this. I also added sub-req 3.3, to be clearer on the intent. I also removed req 4, since it was largely overlapping with req 1.
I also edited the question and removed the related sub-questions, so it is more focused on the single main HOWTO-question in the title. The requirements in my question should provide a clear objective standard for which answers would be considered best.
I realise the question might entail a search for the Holy Grail of Google Sheet multithreading workarounds, as #TheMaster has pointed out in private. Ideally, Google would provide one or more features to support multithreading, map-reduce, or more permissionless sharing. But until then I would really like to know what is the optimal workaround within the current constraints we have. I would hope this question is relevant to others as well, even considering the tight requirements.
If you publish a web-app with "anyone, even anonymous", execute as "Me", then the custom function can use UrlFetchApp.fetchAllAuthorization not needed to post to that web-app. This will run in parallelproof. This solves all the three requirements.
Caveat here is: If multiple people use the sheet, and the custom function will have to post to the "same" webapp (that you published to execute as you) for processing, Google will limit simultaneous executionsquota limit:30.
To workaround this, You can ask people using your sheet to publish their own web-apps. They'll have to do this once at the beginning and no authorization is needed.
If not, you'll need to host a custom server for the load or something like google-cloud-functions might help
I ended up using the naïve workaround that I mentioned in my post:
I can conceive of an other potential naïve workaround which would be
to split the function up into several custom functions formulas and do
the parallelization (by some kind of Map/Reduce) inside the
spreadsheet itself (storing intermediary results back into the
spreadsheet, and having custom function formulas work on that as
reducers). But that's undesired, and probably unfeasible, in my case.
I initially disregarded it because it involves having an extra sheet tab with calculations which was not ideal. But when I reflected on it after investigating alternative solutions, it actually solves all the stated requirements in the most non-intrusive manner. Since it doesn't require anything extra from users the spreadsheet is shared with. It also stays 'within' Google Sheets as far as possible (no semi- or fully external Web App needed), doing the parallelization by relying on the native parallelization of concurrently executing spreadsheet cells, where results can be chained, and appear to the user like using regular formulas (no extra menu item or run-this-script-buttons necessary).
So I implemented MapReduce in Google Sheets using custom functions each operating on a slice of the interval I wanted to calculate. The reason I was able to do that, in my case, was that the input to my calculation was divisible into intervals that could each be calculated separately, and then joined later.**
Each parallel custom function then takes in one interval, calculates that, and outputs the results back to the sheet (I recommend to output as rows instead of columns, since columns are capped at 18 278 columns max. See this excellent post on Google Spreadsheet limitations.) I did run into the only 40,000 new rows at a time limitation, but was able to perform some reducing on each interval, so that they only output a very limited amount of rows to the spreadsheet. That was the parallelization; the Map part of MapReduce. Then I had a separate custom function which did the Reduce part, namely: dynamically target*** the spreadsheet output area of the separately calculated custom functions, and take in their results, once available, and join them together while further reducing them (to find the best performing results), to return the final result.
The interesting part was that I thought I would hit the only 30 simultaneous execution quota limit of Google Sheets. But I was able to parallelize up to 64 independently and seemingly concurrently executing custom functions. It may be that Google puts these into a queue if they exceed 30 concurrent executions, and only actually process 30 of them at any given time (please comment if you know). But anyhow, the parallelization benefit/speedup was huge, and seemingly nearly infinitely scalable. But with some caveats:
You have to define the number of parallelised custom functions up front, manually. So the parallelization doesn't infinitely auto-scale according to demand****. This is important because of the counter-intuitive result that in some cases using less parallelization actually executes faster. In my case, the result set from a very small interval could be exceedingly large, while if the interval had been larger then a lot of the results would have been ruled out underway in the algorithm in that parallelised custom function (i.e. the Map also did some reduction).
In rare cases (with huge inputs), the Reducer function will output a result before all of the parallel (Map) functions have completed (since some of them seemingly take too long). So you seemingly have a complete result set, but then a few seconds later it will re-update when the last parallel function returns its result. This is not ideal, so to be notified of this I implemented a function to tell me if the result was valid. I put it in the cell above the Reduce function (and colored the text red). B6 is the number of intervals (here 4), and the other cell references go to the cell with the custom function for each interval: =didAnyExecutedIntervalFail($B$6,S13,AB13,AK13,AT13)
function didAnyExecutedIntervalFail(intervalsExecuted, ...intervalOutputs) {
const errorValues = new Set(["#NULL!", "#DIV/0!", "#VALUE!", "#REF!", "#NAME?", "#NUM!", "#N/A","#ERROR!", "#"]);
// We go through only the outputs for intervals which were included in the parallel execution.
for(let i=0; i < intervalsExecuted; i++) {
if (errorValues.has(intervalOutputs[i]))
return "Result below is not valid (due to errors in one or more of the intervals), even though it looks like a proper result!";
}
}
The parallel custom functions are limited by Google quota of max 30 sec execution time for any custom function. So if they take too long to calculate, they still might time out (causing the issue mentioned in the previous point). The way to alleviate this timeout is to parallelise more, dividing into more intervals, so that each parallel custom function runs below 30 second.
The output of it all is limited by Google Sheet limitations. Specifically max 5M cells in a spreadsheet. So you may need to perform some reduction on the size of the results calculated in each parallel custom function, before returning its result to the spreadsheet. So that they each are below 40 000 rows, otherwise you'll receive the dreaded "Results too large" error). Furthermore, depending on the size the result of each parallel custom function, it would also limit how many custom functions you could have at the same time, as they and their result cells take space in the spreadsheet. But if each of them take in total, say 50 cells (including a very small output), then you could still parallelize pretty much (5M / 50 = 100 000 parallel functions) within a single sheet. But you also need some space for whatever you want to do with those results. And the 5M cells limit is for the whole Spreadsheet in total, not just for one of its sheet tabs, apparently.
** For those interested: I basically wanted to calculate all combinations of a sequence of bits (by brute force), so the function was 2^n where n was the number of bits. The initial range of combinations was from 1 to 2^n, so it could be divided into intervals of combinations, for example, if dividing into two intervals, it would be one from 1 to X and then one from X+1 to 2^n.
*** For those interested: I used a separate sheet formula to dynamically determine the range for the output of one of the intervals, based on the presence of rows with content. It was in a separate cell for each interval. For the first interval it was in cell S11 and the formula looked like this:
=ADDRESS(ROW(S13),COLUMN(S13),4)&":"&ADDRESS(COUNTA(S13:S)+ROWS(S1:S12),COLUMN(Z13),4) and it would output S13:Z15 which is the dynamically calculated output range, which only counts those rows with content (using COUNTA(S13:S)), thus avoiding to have a statically determined range. Since with a normal static range the size of the output would have to be known in advance, which it wasn't, or it would possibly either not include all of the output, or a lot of empty rows (and you don't want the Reducer to iterate over a lot of essentially empty data structures). Then I would input that range into the Reduce function by using INDIRECT(S$11). So that's how you get the results, from one of the intervals processed by a parallelized custom function, into the main Reducer function.
**** Though you could make it auto-scale up to some pre-defined amount of parallelised custom functions. You could use some preconfigured thresholds, and divide into, say, 16 intervals in some cases, but in other cases automatically divide into 64 intervals (preconfigured, based on experience). You'd then just stop / short-circuit the custom functions which shouldn't participate, based on if the number of that parallelised custom function exceeds the number of intervals you want to divide into and process. On the first line in the parallelised custom function: if (calcIntervalNr > intervals) return;. Though you would have to set up all the parallel custom functions in advance, which can be tedious (remember you have to account for the output area of each, and are limited by the max cell limit of 5M cells in Google Sheets).
I'm a mechanical engineer, and I have developed a pretty cool spreadsheet that I use to size some steel members for lifting beams. The set back is that I need to do some trial and error in the selection of the member until I get one that gets as close to the allowable limits as possible.
What I'm hoping to improve on is to develop a function that based upon a length and weight variable that I enter, the program runs a loop and automatically selects the best member size(s) based upon a list of the members and their physical properties. Is this possible?
Yeah, depending on the complexity, either a simple search through parameters (less than, more than etc) might bring you the answer. You can do it quite easily via Pandas library. Just load up the excel as pandas DataFrame (pandas.read_excel()), which then will allow you to perform the searches on that DataFrame object.
If you want to run some optimization algo, you should look into SciPy's optimize to get what you're looking for based on the input data (it handles unconstrained and constrained functions).
Of course, the question you've stated is quite general, so I only pointed the direction. More info would be better.
I'm developing an add-in for Excel using the Office Add-ins platform. In this add-in I'm writing data to a range using the setSelectedDataAsync** function. It works fine, but after the data is written, I'm not able to delete or edit the cells (although I can select new ranges) unless I click anywhere outside the worksheet or double click a cell. I think it is an issue with Excel not regaining focus correctly (the filename in the top of the app remains grayed out).
Some users seem to think that Excel becomes unresponsive, which is a problem.
Is this a known issue? Is there a work around for this?
** I have noticed that setSelectedDataAsync is way, way quicker than setting range.values to a matrix and then ctx.sync(). Am I losing some important functionality by not using the latter method?
This is not a known issue (unable to interact with worksheet after setting the data). We can look into that.
Surprised to hear that setSelectedDataAsync works faster than the range.values set. The batched syntax allows you to combine not just one instruction, but many related instructions such as setting number format, font, background, etc. and you can do a single sync() to send all instructions in one batch. So, it is more efficient when you combine related instructions together.
There is no restriction of which API to use as such; however the Excel1.1 version was introduced with Office 2016 and then there have been many releases since then incrementally adding new features along the way.
setSelectedDataAsync() API was designed to work across hosts such as Excel, Word, etc. and hence doesn't go deeper in-terms of setting number format, formats, etc.
How is it possible to switch to another worksheet while the results or computations of the current worksheet are being fetched?
Currently we have to wait for the current computation to complete before moving to another sheet. Ideally I would like to know how can I push the current worksheet job in the background and resume working on another worksheet in Tableau?
Note: Extracts not an option nor multiple instances as the source data has millions of records.
I don't think this is functionality Tableau offers right now, but here are a few things that might speed up development:
If possible, extract your data, rather than using a live connection. I know this isn't always an option, but it can remove a lot of overhead, particularly during development.
Optimize your extracts. Not having to recompute all of your calculated fields every time you make a query can make for some pretty notable speed boosts.
By far the best way to minimize your load times is to subset your data while you're building the worksheets. During development, it might not be necessary to load every row of your dataset. In more cases than not, a subset will be enough to confirm that your worksheets and calculations are working as you need them to. You could try filtering to just a month's worth of data, for example, or maybe just a handful of individuals/stores/dog breeds/Skrillex songs.
Nadir's suggestion to pause auto-updates is a great one, but if you're building more complicated views or more intricate calculations, not having real-time feedback on your work can make development a lot more challenging. However, if subsetting your data isn't an option, this might be your best route. One way to speed this process up a bit would be to toss all of the sheets you want to load onto a dashboard and resume auto updates from there. Note that this dashboard would not be a formatted production dashboard — it's simply serving as a drop point for the sheets you'd like to load all at once.
I've never found a way to achieve this. But it is worth noting that you can have multiple Tableau workbooks open at the same time and they run in separate threads.
So, if you can split your work across workbooks, then you can switch from one to the other during calculations quite easily without the calculation impeding your work. You may be able to merge workbooks later if you organise the split of your work the right way.
If you are using a Live Data Source, you can pause auto-updates, until you are ready to see the results/computations. I know this does not completely help with what you are trying to do but it does give you a chance to go through and create the worksheets you need and then have it all load at once.
I'm building a network business model in excel. A similar model is that of Gawker Media.
In my model I have a number properties that have some over lap of audience. Each property attracts users, which in turn affords cross promotional opportunities. In the case of Gawker they have a series of blogs whose audience will likely read several of their blogs in their network.
If gawker launched a new blog they're able to direct traffic from their blog network.
Creating a model for a single blog is fairly simple - although the initial assumptions are harder. The next step is to model the network effect.
Excel provides a scenarios manager that allows me to vary the key assumptions in the basic model. This is almost perfect, I can model the launch of 10 properties, each with different launch assumptions and see the summary.
Where I need help is figuring out how I can vary the initial number of users for the launch of each property. In other words, once the network is established, its possible to drive people to any new property launched on the network.
I don't believe the scenario manager will do what I need.
So, I'm wondering if its possible to use the model work sheet as a UDF? The UDF would need to spit out the monthly revenue and unique users given a number of input assumptions.
I would then be able to create my own summary sheet for the 10 properties and using the total uniques for each property get a summary for the network. This network summary would be used to determine how many people could be driven to the launch of a new property.
In effect, the only difference to the scenario manager is that I need one of my input variables (initial users) to be programmatically generated as a function of the number of people in the network at the time of launch.
I'm hoping its possible to achieve something along these lines in excel. I could drop down and create the whole model in Java, but then its much harder to share with business colleagues!
Thanks - Matt.
You could try Data Table.
It only allows you to analyse the effect of varying 2 input parameters, but you can create several data tables, and each parameter can take hundreds of different values.
It's little know, but efficient and available since Excel 3.0.
There is a product that I have researched but never used - search for calc4web. It takes a sheet of formulas and generates code (C++) that can be compiled into an XLL add-in. Then you can call a function that does what your sheet does. But of course then you have an XLL to distribute, and a build step every time you change your logic, which defeats much of the point of using a spreadsheet.
In my case, I wound up writing some very simple VBA code to vary my sheet "inputs" using the scenario manager, and capture my "outputs". This works if you have a batch of inputs that you can just point your macro at and step through.
EDIT:
See here for a VBA-only example of doing this:
using a sheet in an excel user defined function