Alternate to sumproduct - excel

Does anyone know a more efficient way to write this? I seem to be having quite a bit of issues with excel giving me "not enough system resources" because I have about 30 slightly varied versions of this in my excel spread sheet.
=SUMPRODUCT((Data!B2:B1000="Human Resources")*(Data!E2:E1000<>"Resolved")*(Data!E2:E1000<>"Closed")*(Data!E2:E1000<>"Cancelled"))
I have looked into countifs but I can't seem to get that to work.

COUNTIFS should do the job, like this
=COUNTIFS(B:B,"Human Resources",E:E,"<>Resolved",E:E,"<>Closed",E:E,"<>Cancelled")

I think what you are after is something like this (note the spacing is exaggerated between the ranges for emphasis, remove for actual use):
=SUMPRODUCT((Data!B2:B1000="Human Resources") * (Data!E2:E1000<>"Closed") , (Data!B2:B1000="Human Resources") * (Data!E2:E1000<>"Cancelled" ) , (Data!B2:B1000="Human Resources") * (Data!E2:E1000<>"Resolved" ))
Test Data:
Human Resources Resolved
Human Resources Closed
Human Resources Cancelled
Human Resources Open
Human Resources Yellow
Human Resources Duck
Human Resources Rock
Human Resources Resolved
Human Resources Closed
Human Resources Cancelled
Parks and Rec 3
Expected: 4
Result: 4
I tried the same formula, changing the references to point to the larger local set, with a random sample of 1000 rows and there was no noticeable slow down or warning. By no means is it bench-marked or optimized, but the intent is left pretty clear.
Larger data sets might cause your machine to run out of available indexes or raw memory, and if you suspect that is the case, partition the data set and do a chunk at a time (perhaps try first splitting it in half attempting 500 rows at once), summing each chunk after all chunks are done.
One last note, for posterity, SUMPRODUCT is fairly version dependent. If you run into problems, you can change it to a strictly SUM formula, with just a bit more work.

Related

How to parallelize execution of a custom function formula while keeping the Google Sheet shareable and permissionless?

I have a Google Sheet with a custom function formula that: takes in a matrix and two vectors from a spreadsheet, does some lengthy matrix-vector calculations (>30 sec, so above the quota), before outputting the result as a bunch of rows. It is single-threaded, since that's what Google Apps Script (GAS) natively is, but I want to parallelize the calculation using a multi-threading workaround, so it speeds it up drastically.
Requirements (1-3):
UX: It should run the calculations automatically and reactively as a custom function formula, which implies that the user doesn't have to manually start it by clicking a run button or similar. Like my single-threaded version currently does.
Parallelizable: It should ideally spawn ~30 threads/processes, so that instead of taking >30 seconds as it now does (which makes it time out due to Google's quota limit), it should take ~1 second. (I know GAS is single-threaded, but there are workarounds, referenced below).
Shareability: I should ideally be able to share the Sheet with other people, so they can "Make a copy" of it, and the script will still run the calculations for them:
3.1 Permissionless: Without me having to manually hand out individual permissions to users (permissionless). For instance whenever someone "Makes a copy" and "Execute the app as user accessing the web app". My rudimentary testing suggest that this is possible.
3.2 Non-intrusive: Without users of the spreadsheet having to give intrusive authorizations like "Give this spreadsheet/script/app access to your entire Google Drive or Gmail account?". Users having to give an non-intrusive authorization to a script/webapp can be acceptable, as long as requirement 3.1 is still maintained.
3.3 UX: Without forcing users to view a HTML sidebar in the spreadsheet.
I have already read this excellent related answer by #TheMaster which outlines some potential ways of solving parallelization in Google Apps script in general. Workaround #3 google.script.run and workaround #4 UrlFetchApp.fetchAll (both using a Google Web App) looks most promising. But some details are unknown to me, such as if they can adhere to requirements 1 and 3 with its sub-requirements.
I can conceive of an other potential naïve workaround which would be to split the function up into several custom functions formulas and do the parallelization (by some kind of Map/Reduce) inside the spreadsheet itself (storing intermediary results back into the spreadsheet, and having custom function formulas work on that as reducers). But that's undesired, and probably unfeasible, in my case.
I'm very confident my function is parallelizable using some kind of Map/Reduce process. The function is currently optimized by doing all the calculations in-memory, without touching the spreadsheet in-between steps, before finally outputting the result to the spreadsheet. The details of it is quite intricate and well over 100 lines, so I don't want to overload you with more (and potentially confusing) information which doesn't really affect the general applicability of this case. For the context of this question you may assume that my function is parallelizable (and map-reduce'able), or consider any function you already know that would be. What's interesting is what's generally possible to achieve with parallelizationin Google Apps Script, while also maintaining the highest level of shareability and UX. I'll update this question with more details if needed.
Update 2020-06-19:
To be more clear, I do not rule out Google Web App workarounds entirely, as I haven't got experience with their practical limitations to know for sure if they can solve the problem within the requirements. I have updated the sub-requirements 3.1 and 3.2 to reflect this. I also added sub-req 3.3, to be clearer on the intent. I also removed req 4, since it was largely overlapping with req 1.
I also edited the question and removed the related sub-questions, so it is more focused on the single main HOWTO-question in the title. The requirements in my question should provide a clear objective standard for which answers would be considered best.
I realise the question might entail a search for the Holy Grail of Google Sheet multithreading workarounds, as #TheMaster has pointed out in private. Ideally, Google would provide one or more features to support multithreading, map-reduce, or more permissionless sharing. But until then I would really like to know what is the optimal workaround within the current constraints we have. I would hope this question is relevant to others as well, even considering the tight requirements.
If you publish a web-app with "anyone, even anonymous", execute as "Me", then the custom function can use UrlFetchApp.fetchAllAuthorization not needed to post to that web-app. This will run in parallelproof. This solves all the three requirements.
Caveat here is: If multiple people use the sheet, and the custom function will have to post to the "same" webapp (that you published to execute as you) for processing, Google will limit simultaneous executionsquota limit:30.
To workaround this, You can ask people using your sheet to publish their own web-apps. They'll have to do this once at the beginning and no authorization is needed.
If not, you'll need to host a custom server for the load or something like google-cloud-functions might help
I ended up using the naïve workaround that I mentioned in my post:
I can conceive of an other potential naïve workaround which would be
to split the function up into several custom functions formulas and do
the parallelization (by some kind of Map/Reduce) inside the
spreadsheet itself (storing intermediary results back into the
spreadsheet, and having custom function formulas work on that as
reducers). But that's undesired, and probably unfeasible, in my case.
I initially disregarded it because it involves having an extra sheet tab with calculations which was not ideal. But when I reflected on it after investigating alternative solutions, it actually solves all the stated requirements in the most non-intrusive manner. Since it doesn't require anything extra from users the spreadsheet is shared with. It also stays 'within' Google Sheets as far as possible (no semi- or fully external Web App needed), doing the parallelization by relying on the native parallelization of concurrently executing spreadsheet cells, where results can be chained, and appear to the user like using regular formulas (no extra menu item or run-this-script-buttons necessary).
So I implemented MapReduce in Google Sheets using custom functions each operating on a slice of the interval I wanted to calculate. The reason I was able to do that, in my case, was that the input to my calculation was divisible into intervals that could each be calculated separately, and then joined later.**
Each parallel custom function then takes in one interval, calculates that, and outputs the results back to the sheet (I recommend to output as rows instead of columns, since columns are capped at 18 278 columns max. See this excellent post on Google Spreadsheet limitations.) I did run into the only 40,000 new rows at a time limitation, but was able to perform some reducing on each interval, so that they only output a very limited amount of rows to the spreadsheet. That was the parallelization; the Map part of MapReduce. Then I had a separate custom function which did the Reduce part, namely: dynamically target*** the spreadsheet output area of the separately calculated custom functions, and take in their results, once available, and join them together while further reducing them (to find the best performing results), to return the final result.
The interesting part was that I thought I would hit the only 30 simultaneous execution quota limit of Google Sheets. But I was able to parallelize up to 64 independently and seemingly concurrently executing custom functions. It may be that Google puts these into a queue if they exceed 30 concurrent executions, and only actually process 30 of them at any given time (please comment if you know). But anyhow, the parallelization benefit/speedup was huge, and seemingly nearly infinitely scalable. But with some caveats:
You have to define the number of parallelised custom functions up front, manually. So the parallelization doesn't infinitely auto-scale according to demand****. This is important because of the counter-intuitive result that in some cases using less parallelization actually executes faster. In my case, the result set from a very small interval could be exceedingly large, while if the interval had been larger then a lot of the results would have been ruled out underway in the algorithm in that parallelised custom function (i.e. the Map also did some reduction).
In rare cases (with huge inputs), the Reducer function will output a result before all of the parallel (Map) functions have completed (since some of them seemingly take too long). So you seemingly have a complete result set, but then a few seconds later it will re-update when the last parallel function returns its result. This is not ideal, so to be notified of this I implemented a function to tell me if the result was valid. I put it in the cell above the Reduce function (and colored the text red). B6 is the number of intervals (here 4), and the other cell references go to the cell with the custom function for each interval: =didAnyExecutedIntervalFail($B$6,S13,AB13,AK13,AT13)
function didAnyExecutedIntervalFail(intervalsExecuted, ...intervalOutputs) {
const errorValues = new Set(["#NULL!", "#DIV/0!", "#VALUE!", "#REF!", "#NAME?", "#NUM!", "#N/A","#ERROR!", "#"]);
// We go through only the outputs for intervals which were included in the parallel execution.
for(let i=0; i < intervalsExecuted; i++) {
if (errorValues.has(intervalOutputs[i]))
return "Result below is not valid (due to errors in one or more of the intervals), even though it looks like a proper result!";
}
}
The parallel custom functions are limited by Google quota of max 30 sec execution time for any custom function. So if they take too long to calculate, they still might time out (causing the issue mentioned in the previous point). The way to alleviate this timeout is to parallelise more, dividing into more intervals, so that each parallel custom function runs below 30 second.
The output of it all is limited by Google Sheet limitations. Specifically max 5M cells in a spreadsheet. So you may need to perform some reduction on the size of the results calculated in each parallel custom function, before returning its result to the spreadsheet. So that they each are below 40 000 rows, otherwise you'll receive the dreaded "Results too large" error). Furthermore, depending on the size the result of each parallel custom function, it would also limit how many custom functions you could have at the same time, as they and their result cells take space in the spreadsheet. But if each of them take in total, say 50 cells (including a very small output), then you could still parallelize pretty much (5M / 50 = 100 000 parallel functions) within a single sheet. But you also need some space for whatever you want to do with those results. And the 5M cells limit is for the whole Spreadsheet in total, not just for one of its sheet tabs, apparently.
** For those interested: I basically wanted to calculate all combinations of a sequence of bits (by brute force), so the function was 2^n where n was the number of bits. The initial range of combinations was from 1 to 2^n, so it could be divided into intervals of combinations, for example, if dividing into two intervals, it would be one from 1 to X and then one from X+1 to 2^n.
*** For those interested: I used a separate sheet formula to dynamically determine the range for the output of one of the intervals, based on the presence of rows with content. It was in a separate cell for each interval. For the first interval it was in cell S11 and the formula looked like this:
=ADDRESS(ROW(S13),COLUMN(S13),4)&":"&ADDRESS(COUNTA(S13:S)+ROWS(S1:S12),COLUMN(Z13),4) and it would output S13:Z15 which is the dynamically calculated output range, which only counts those rows with content (using COUNTA(S13:S)), thus avoiding to have a statically determined range. Since with a normal static range the size of the output would have to be known in advance, which it wasn't, or it would possibly either not include all of the output, or a lot of empty rows (and you don't want the Reducer to iterate over a lot of essentially empty data structures). Then I would input that range into the Reduce function by using INDIRECT(S$11). So that's how you get the results, from one of the intervals processed by a parallelized custom function, into the main Reducer function.
**** Though you could make it auto-scale up to some pre-defined amount of parallelised custom functions. You could use some preconfigured thresholds, and divide into, say, 16 intervals in some cases, but in other cases automatically divide into 64 intervals (preconfigured, based on experience). You'd then just stop / short-circuit the custom functions which shouldn't participate, based on if the number of that parallelised custom function exceeds the number of intervals you want to divide into and process. On the first line in the parallelised custom function: if (calcIntervalNr > intervals) return;. Though you would have to set up all the parallel custom functions in advance, which can be tedious (remember you have to account for the output area of each, and are limited by the max cell limit of 5M cells in Google Sheets).

Need help simplifying or improving a weighted distribution formula in excel (math/excel/programming noob)

I've been having some fun creating a rather extensive inventory in Google Sheets for my collection of trading cards. I buy the majority of my collectibles in lots meaning that I pay a total of X dollars for Y number of cards of different values (as opposed to buying each card individually).
In my spreadsheet I have a "Purchase Price" column where I enter in the price I paid for each card. If I buy 1 lot of 10 cards, to find the value of each of those cards you would just divide the cost of the lot by the number of cards in the lot. So if I purchased 1 lot of 10 cards for a total of $100, the Purchase Price of each card would equal $10. Simple enough right?
Well, that would be if you were OK with entering the rare, uncommon, and common cards in the lot with having the same exact purchase price even though their real market values would all be different. So, what I did was create a formula that automatically adjusts the purchase price for each card that's part of a lot based on its rarity so it's at least closer in accuracy to the actual market value of the card.
Here is the formula:
=IFS(B2="C",D2*$B$15*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,
B2="U",D2*$B$16*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,
B2="R",D2*$B$17*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2)
Not sure if that means much to anyone, so here's a link to an example spreadsheet of the formula in action below.
And if you don't care to check that out, here's a screenshot:
The problem:
So the formula works exactly how I want it to work EXCEPT when there are 0 commons in a lot. When that happens I get a #DIV/0! error saying that "Function DIVIDE parameter 2 cannot be zero." I understand why this is happening since it doesn't like to divide by 0 in the first line, but what I don't understand is how to fix it.
How can I fix the DIV error, or is there a better way to do this, perhaps an alternative formula or approach? I am not a programmer and somewhat of a beginner at Excel.
Two suggestions.
Embed each division in an IFERROR() function as shown below. This function will return zero instead of an error. You can substitute another calculation for that. In fact, depending upon which level you introduce the IFERROR at (embracing just one of the three calculations or all three) you might choose to embed the IFS in another IFS that tests for zeroes. Once you have no more divisions by zero there would be no more need for IFERROR. So, it becomes a question of formula efficiency.
=IFERROR(D2*$B$15*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,0)
Forget about all of this and seek a commercially logical solution. The logic says that you never buy a lot unless it contains some items you want, and the seller never has a lot that doesn't contain rubbish. In the end you get inundated with commons, meaning you have more of them than you can ever hope to sell. So, what's their real, commercial value? Valuate your rare and uncommon cards individually and all the baggage not at all. You will find the outcome more realistic both for the Commons and the Rare. BTW, that's what they do with stamp or coin collections.
I realise this is more a comment than an answer, but it's too large to put it as a comment:
Your formula is unreadable, as you can see:
=IFS(B2="C",D2*$B$15*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,
B2="U",D2*$B$16*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,
B2="R",D2*$B$17*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2)
First I'd advise you to create a new column (you might always hide it), I:I, which contains following formula (for I2):
=B2*G2/((D2*const_weigth_common)+(E2*const_weigth_uncommon)+(F2*const_weigth_rare))/D2
(And you give this a meaningful header name)
As far as names for $B$15:$B$17 are concerned, do something like:
$B$15 : const_weigth_common
$B$16 : const_weigth_uncommon
$B$17 : const_weigth_rare
(You do know how to use names in Excel?)
Like this, your formula becomes:
=IFS(B2="C",I2 * const_weigth_common,
B2="U",I2 * const_weigth_uncommon,
B2="R",I2 * const_weigth_rare);
As far as your error is concerned: as mentioned in another answer, this might be tackled using the =IF() formula, so I2 becomes:
=IF(D2<>0;...;_ERROR_VALUE); // up to you how to change your error value
Like this, your formulas become much clearer and it will be easier to solve possible problems.
I Ain't No Math-A-Magician
But I can help you with this...
The way I see it, there are three schools of thought and you need to figure out which one is yours:
The programmer - trapping #div/0 errors all day
The purist - there is only one answer, and it is undefined
The pragmatic - a graph shows results approaching a limit
I think the programmer is either dangerous or ineffectual, or dangerously ineffectual. Sure he can trap there error but what exactly does that do? I'll tell you exactly what that does, it puts lipstick on a pig. It's literally replacing one string, "#div/0!", with a different more aesthetic string. Or he can play second fiddle to the devil and publicly killing the one bug everyone knew how to defend at defend sgainst and creating another?
I think the purist is right about one thing, the answer is what it is and it can't be a anything else; but he's also wrong. A precisely known theoretical answer may very well be undefined, but where in the real world has anyone ever seen division by zero? It's a mathematical construct like infinity, we can safely ignore it. Don't believe me? What happens when you dvide the sun by zero? Go ahead, I'll wait for your answer.
I think these things because I am grounded in pragmatism One string is not better than the other if they both symbolize an attempt to divide by zero. I prefer knowing who my enemies are so I my keep them in front of it me. Nature truly abhorss the undefined and is only slightly displeased with a vacuum.

data and structure cleansing of excel sheets

I have over 6,000 excel sheets. While all the sheets describe the same thing, they are independently formatted. They all have between 9 and 13 columns, but they are out of order, the column names are independently misspelled, and they may or may not have a second, or third, column header.
I am currently trying in python to read cells in a left-down-right-up motion to attempt to locate the same data, but there is physically too many differences in structure names, column ordering, and data definitions to lock them in one a time. Is there a tool that I can use to read these documents and conform them to a single format, via a rapid mapping function?
Thanks much.
Thanks
Wow, it's the Ultimate Data Horror Story.
I want to ask how you ever let it get this way... but I actually don't want to know; I'm already going to have nightmares about this.
It's like that Hoarding show on TV, but with data.
No, I'm afraid that if you can't even identify a pattern then there's no magic function that will be able to either.
But that doesn't mean it's a lost cause. It's just going to need some human interaction, and there are ways to minimize the pain.
What you need is a custom interface that will load the documents one by one, and will walk a human through clicking each relevant column or area, and then automatically load the next document.
There would also need to be buttons for sorting things like obvious garbage sheets (blanks?), "unknowns" (that get put in a folder for advanced research later), and other "unpredictables" may come up during the process.
Also, perhaps once you get into it, you'll notice a pattern you're not thinking of, like maybe *"the person who handled the files from 2002 to 2004 set them up this way"*, or, "when Budget is misspelled, it's always either Bugdet or Budteg".
In this scope, little patterns like that can make a big difference.
Depending on your coding skills, you may or may not need outside assistance with this. I assume this is not data that can just get thrown out, or you wouldn't be asking...
If each document took an average of 20 seconds to process, that would be about 33 hours in total. An hour a day an it's done in a month. Or someone full-time, and it's done in a week.
Do you have a budget you can throw at this? Data archaeology is an actual thing! Hell, I'll do it for you for the right price... (wouldn't break the bank, depending on how urgent it is, of course!)
Either way, this ain't going to be fun for "someone"...

How can I better optimize a search in possible Fantasyland constructions in Pineapple poker?

So, a bit of explanation to preface the question. In variants of Open Face Chinese poker, you are dealt one and one card, which are to be placed into three different rows, and the goal is to make each row increasingly better, and of course getting the best hands possible. A difference from normal poker is that the top row only contains three cards, so three of a kind is the best possible hand you can get there. In a variant of this called Pineapple, which is what I'm working on a bot for, you are dealt three and three cards after the initial 5, and you discard one of those three cards each round.
Now, there's a special rule called Fantasyland, which means that if you get a pair of queens or better in the top row, and still manage to get successively better hands in the middle and top row, your next round becomes a Fantasyland round. This is a round where are dealt 15 cards at the same time, and are free to construct the best three rows possible (rows of 3, 5, and 5 cards, and discarding 2 of them). Each row yields a certain number of points (royalties, as they're called) depending on which hand is constructed, and each successive row needs better and better hands to yield the same amount of points.
Trying to optimize solutions for this seemed like a natural starting point, and one of the most interesting parts as well, so I started working on it. My first attempt, which is also where I'm stuck, was to use Simulated Annealing to do local search optimization. The energy/evaluation function is the amount of points, and at first I tried a move/neighbor function of simply swapping two cards at random, having first places them as they were drawn. This worked decently, managing to get a mean of around 6 points per hand, which isn't bad, but I often noticed that I could spot better solutions by swapping more than one pair of cards at the same time. Thus, I changed the move/neighbor function to swapping several pairs of cards at once, and also tried swapping a random amount of pairs between 1 and 3 through 5, which managed to yield slightly better results, but still I often spot better solutions by simply taking a look.
If anyone is reading this and understands the problem, any idea on how to better optimize this search? Should I use a different move/neighbor function, different Annealing parameters, or perhaps a different local search method, or even some kind of non-local search? All ideas are welcome and deeply appreciated.
You haven't indicated a performance requirement, so I'll assume that this should work quickly enough to be usable in a game with human players. It can't take an hour to find the solution, but you don't need it in a millisecond, either.
I'm wondering of simulated annealing is the right method. This might be an opportunity for brute force.
One can make a very fast algorithm for evaluating poker hands. Consider an encoding of the cards where 13 bits encode the card value and 4 bits encode the suit. OR together the cards in the hand and you can quickly identify pairs, triples, straights, and flushes.
At first glance, there would seem to be 15! (13,076,743,680,000) possible positions for all the cards which are dealt, but there are other symmetries and restrictions that reduce the meaningful combinations and limit the space that must be explored.
One important constraint is that the bottom row must have a higher score than the middle row and that the middle row must have a higher score than the top row.
There are 3003 sets of bottom cards, COMBINATIONS(15 cards, 5 at a time) = (15!)/(5!(15-5)!) = 3003. For each set of possible bottom cards, there are COMBINATIONS(10 cards, 5 at a time) = (10!)/(5!(10-5!)) = 252 sets of middle cards. The top row has COMBINATIONS(5 cards, 3 at a time) = (5!)/(3!*(5-3)!) = 10. With no further optimization, a brute force approach would require evaluating 3003*252*10 = 7567560 positions. I suspect that this can be evaluated within an acceptable response time.
A further optimization uses the constraint that each row must be worth less than the row below. If the middle row is worth more than the bottom row, the top row can be ignored by pruning the tree at that point, which removes a factor of 10 for those cases.
Also, since the bottom row must be work more than the middle and top rows, there may be some minimum score the bottom row must achieve before it is worth trying middle rows. Rejecting a bottom row prunes 2520 cases from the tree.
I understand that there is a way to use simulated annealing for estimating solutions for discrete problems. My use of simulated annealing has been limited to continuous problems with edge constraints. I don't have a good intuition for how to apply SA to discrete problems. Many discrete problems lend themselves to an exhaustive search, provided the search space can be trimmed by exploiting symmetries and constraints in the particular problem.
I'd love to know the solution you choose and your results.

most suited search algorithm?

I'm now facing a problem and I'm not sure what the right solution is. I'll try to explain it, and I hope someone has some good solutions for me:
I have two big data arrays. One that I'm browsing, with something between 50^3 and 150^3 data samples (usually between 50 and 100, rare worst case scenario 150).
For every sample, I want to make a query on another structure that is around the same size (so huge number of total combinations, I can't explore them all).
The query can't be predicted exactly but usually, it is something like :
structure has fields A B C D E F G (EDIT : in total, it's something like 10 to 20 int fields).
query is something like :
10 < A < 20 and B > 100 and D > 200.
Yes, it's really close to SQL.
I thought to put this in a database, but actually it would be a standalone database, and I can work in RAM to make it even faster (speed is an essential criteria).
I thought to try something using GPGPU but it seems it's a terrible idea and despite search can be parallel, it does't seem to be a good idea, searching an unpredictable number of results isn't a good application (if someone can tell me if my understanding has been right it would help me confirm that I should forgive this solution).
EDIT : the nubmer of results is unpredictable because of the query nature, but the it is quite low, since the purpose is to find a low number of well suited combinations
Then since I could use a DB, why not make a RAM B-Tree? it seems close to the solution, but is it? If it is, how should I build my indexes? Can I really do multidimensional indexes, since multidimensional search will always exist? probably UB-Tree or R-tree could do the job (but in my second data sample, I could have some duplicates, so doesn't it make the R-TREE non applicable?).
The thing is, I'm not sure I understand properly all those right now, so if one of you knows trees (and gpgpu, and even solutions I didn't think to), perhaps you could let me know which solution I should explore, learn, and implement?
GPGPU is not a suitable choice due to the fact that you are limited by their capacity and since you are not telling us the data size of these samples I am assuming that a titan x tier card will not suffice. If you could go really wild, TESLA or FirePro, then it is actually worth it since you mentioned that speed really matters. But I am going to speculate that these things are out of your budget, and considering that you have to learn CUDA or OpenCL to make something that will generally be a pain to port here and there, my take is "No".
You mentioned that you have an unpredictable number of results and this is a bad thing. You should develop a formula that will "somewhat" calculate the amount of space which will be needed otherwise it will be disappointing to have your program work on something for quite some time only to get a capacity error/crash. On the other hand, if the RAM capacity is not sufficient, you could work "database style" fetching data from storage when needed(and this is quite bothersome to implement due to scheduling implementations etc).
If you do have the time to go bespoke, here is a helpfull link. Remember, you are going to stumble a lot, but when you make it you will have learnt a tone of stuff:
https://www.quora.com/What-are-some-fast-similarity-search-algorithms-and-data-structures-for-high-dimensional-vectors
In my opinion, an in memory database is the easiest and at the same time most reliable thing to do without compromising on speed. Which one to implement is on you. I think MemSQL is a good one.

Resources