hash on large set of values, retaining information about sequence and approximate size: via Excel VBA - excel

In Excel, I have two large columns of values that are usually identical in size and sequence. I want a hash for each column to check that the columns are in fact identical (with pretty good probability).
I have an MD5 hash algorithm which gives a has for a single string, but I want something for a large (about 20k) set of values). This would be slow.
I can use a simple function like this:
hash = mean + stdev + skewness
In VBA, this looks like:
Function hash(x As Range)
Application.Volatile
hash = Application.WorksheetFunction.StDev(x) + Application.WorksheetFunction.Skew(x) + Application.WorksheetFunction.Average(x)
End Function
and this gives me some confidence that the columns are the same in terms of magnitudes; but sometimes the values are identical but not in the correct order, and my hash cannot detect this. I need my hash to be able to detect wrong ordering.
I do not require 'anonymizing' or 'randomizing' of the data- there is no issue of privacy etc. In fact, a kind of 'proportional' hash that returns a small value for small errors and a large value for large errors would be extremely useful. Given that some rounding errors may result in small differences that I do not care about, the MD5 algorithm sometimes gives me false warnings.
Unfortunately the data is within excel (because it is the result of previous excel manipulations), and so a VBA function that would keep me in Excel, and allow me to proceed once the columns have been verified, would be best. So I'd like a function of the form
Of course, I could just compare the excel columns by making another column, and perform a large boolean AND (cellA1 =cellB1, cellA2=B2) etc. But this would be tedious and inefficient. I actually have thousands of these columns to compare in order to find bugs.
Any ideas?

The easiest way to compare two columns for near-equality is to use the worksheet function SUMXMY2(). This computes the squared-Euclidean distance between two ranges, thought of as vectors in higher-dimensional space. For example, to check if A1:A20000 is very close to B1:B20000, us the comparison
SUMXMY2(A1:A20000, B1:B20000) < tol
where tol is an error threshold which determines how much round-off error you are willing to tolerate.
Your original idea of using hashing could be useful in some circumstances. To make it tolerant of round-off error, look into the theory of Locality-sensitive hashing rather than cryptographical hashes such as MD5. Any such algorithm if implemented in VBA would be somewhat slow, but depending on what you are trying to do they could be useful.

Related

How to sort rows in Excel without having repeated data together

I have a table of data with many data repeating.
I have to sort the rows by random, however, without having identical names next to each other, like shown here:
How can I do that in Excel?
Perfect case for a recursive LAMBDA.
In Name Manager, define RandomSort as
=LAMBDA(ζ,
LET(
ξ, SORTBY(ζ, RANDARRAY(ROWS(ζ))),
λ, TAKE(ξ, , 1),
κ, SUMPRODUCT(N(DROP(λ, -1) = DROP(λ, 1))),
IF(κ = 0, ξ, RandomSort(ζ))
)
)
then enter
=RandomSort(A2:B8)
within the worksheet somewhere. Replace A2:B8 - which should be your data excluding the headers - as required.
If no solution is possible then you will receive a #NUM! error. I didn't get round to adding a clause to determine whether a certain combination of names has a solution or not.
This is just an attempt because the question might need clarification or more sample data to understand the actual scenario. The main idea is to generate a random list from the input, then distribute it evenly by names. This ensures no repetition of consecutive names, but this is not the only possible way of sorting (this problem may have multiple valid combinations), but this is a valid one. The solution is volatile (every time Excel recalculates, a new output is generated) because RANDARRAY is volatile function.
In cell D2, you can use the following formula:
=LET(rng, A2:B8, m, ROWS(rng), seq, SEQUENCE(m),
idx, SORTBY(seq, RANDARRAY(m,,1,m, TRUE)), rRng, INDEX(rng, idx,{1,2}),
names, INDEX(rRng,,1), nCnts, MAP(seq, LAMBDA(s, ROWS(FILTER(names,
(names=INDEX(names,s)) * (seq<=s))))), SORTBY(rRng, nCnts))
Here is the output:
Update
Looking at #JosWoolley approach. The generation of the random sorting can be simplified so that the resulting formula could be:
=LET(rng, A2:B8, m, ROWS(rng), seq, SEQUENCE(m), rRng,SORTBY(rng, RANDARRAY(m)),
names, TAKE(rRng,,1), nCnts, MAP(seq, LAMBDA(s, ROWS(FILTER(names,
(names=INDEX(names,s)) * (seq<=s))))), SORTBY(rRng, nCnts))
Explanation
LET function is used for easy reading and composition. The name idx represents a random sequence of the input index positions. The name rRng, represents the input rng, but sorted by random. This sorting doesn't ensure consecutive names are distinct.
In order to ensure consecutive names are not repeated, we enumerate (nCnts) repeated names. We use a MAP for that. This is a similar idea provided by #cybernetic.nomad in the comment section, but adapted for an array version (we cannot use COUNTIF because it requires a range). Finally, we use SORTBY with input argument by_array, the map result (nCnts), to ensure names are evenly distributed so no consecutive names will be the same. Every time Excel recalculate you will get an output with the names distributed evenly in a different way.
Not sure if it's worth posting this, but I might as well share the results of my research such as it is. The problem is similar to that of re-arranging the characters in a string so that no same characters are adjacent The method is just to insert whichever one of the remaining characters (names) has the highest frequency at this point and is not the same as the previous character, then reduce its frequency once it has been used. It's fairly easy to implement this in Excel, even in Excel 2019. So if the initial frequencies are in D2:D8 for convenience using Countif:
=COUNTIF(A$2:A$8,A2)
You can use this formula in (say) F2 and pull it down:
=INDEX(A$2:A$8,MATCH(MAX((D$2:D$8-COUNTIF(F$1:F1,A$2:A$8))*(A$2:A$8<>F1)),(D$2:D$8-COUNTIF(F$1:F1,A$2:A$8))*(A$2:A$8<>F1),0))
and similarly in G2 to get the ages:
=INDEX(B$2:B$8,MATCH(MAX((D$2:D$8-COUNTIF(F$1:F1,A$2:A$8))*(A$2:A$8<>F1)),(D$2:D$8-COUNTIF(F$1:F1,A$2:A$8))*(A$2:A$8<>F1),0))
I'm fairly sure this will always produce a correct result if one is possible.
HOWEVER there is no randomness built in to this method. You can see if I extend it to more data that in the first several rows the most common name simply alternates with the other two names:
Having said that, this is a bit of a worst case scenario (a lot of duplication) and it may not look too bad with real data, so it may be worth considering this approach along with the other two methods.

Calculate the minimum value of each column in a matrix in EXCEL

Alright this should be a simple one.
I apologize in case it has been already solved, but I can only find posts related to solving this issue with programming languages and not specifically to EXCEL.
Furthermore, I could find posts that address a sub-problem of my question (e.g. regarding limitation of certain EXCEL functions) and should solve/invalidate my request but maybe, just maybe, there is a workaround.
Problem statement:
I want to calculate the minimum value for each column in an EXCEL matrix. Simply enough, I want to input a 2D array (mxn matrix) in a function and output an array with dimension 1xm where each item is the minimum value MIN(nj) of each nj column.
However, I want to solve this with specific constraints:
Avoid using VBA and other non-function scripting: that I could devise myself;
All in one function: what I want to achieve here is to have one and one function only, not split the problem into multiple passages (such as for example copypasting a MIN() function below each column, that wouldn't do it);
The result should be a transposable array (which is already ok, I assume);
Where I am stranded with my solution so far:
The main issue here is that any function I am trying to use takes the entire matrix as a single array input and would calculate the MIN() of the entire matrix, not each column. My current (not working) function for an exemplary 4x4 matrix in range A1:D4 would be as below (the part in bold is where it is clearly not working):
=MIN(INDEX(A1:D4,SEQUENCE(4,4,1,1)))
which ofc does not work, because INDEX() does probably not "understand" SEQUENCE() as an array of items to take into account. Another, not working, way of solving this is to input a series of ranges (A1:A4;B1:B4;C1:C4;D1:D4) so that INDEX() "understands" the ranges as single columns, but ofc does not know and I do not know sincerely how to formulate that. I could use INDIRECT() in some way to reference the array of ranges, but do not know how and could find a way by searching online.
Fundamental question is: can a function, which works with single arrays, also work with multiple arrays? Basically, I do not know how to communicate an EXCEL array formula, that each batch of data I am inputting is a single array and must be evaluated separately (this is very easily solved with for() cycles, I know).
Many thanks for any suggestion and any workaround, any function and solution works as longs as it fits in the constrains defined above (maybe a LAMBA() function? don't know).
This is ofc a simplification of a way more complex problem (I am trying to calculate the annual mean temperature evolution for a specific location by finding the value - for each year from 1950 to 2021 - that is associated to the lat/lon coordinates that are the nearest to the one of the location inputted, given a netCDF-imported grid of time-arrayed data; the MIN() function is used to selected the nearest location, which is then used, via INDEX() to find temp data). I need to do this in one hit (meaning just pasting the function, which evaluates a matrix of data that is referenced by a fixed range), so that I can just use it modularly for other data sets. I already have a working solution, which is "elegant"* enough, but not "elegant"* as the one I could develop solving this issue.
*where "elegant"= it saves me one click every time for 1000+ datasets when applying the function.
If I understand your problem correct then this should solve it:
=BYCOL(A1:D4,LAMBDA(d,MIN(d)))

Create Random number sequence without duplicates in excel for almost million rows or records

I would want to create a random sequence of numbers in 11 digit format and that should run from 10000000000 to 999999999999 and each of the values should be unique and i would like to populate almost 20-50 million worth of records in excel without having to keep dragging all the way down at the bottom of the cell by clicking + button
I tried using RANDBETWEEN but seems like there are duplicates and i have to keep dragging which is a time consuming activity,is there any alternative better way to accomplish this ?
=RANDBETWEEN(10000000000,999999999999)
For that many unique numbers I suggest using an encryption, where the output is guaranteed unique for unique inputs.
Simply encrypt the numbers 0, 1, 2, ... for different unique inputs. You will need to use the same encryption key and other inputs (IV, nonce etc.) to guarantee unique outputs.
You will need to do some processing on the outputs to get them into the required range. Have a look at Format Preserving Encryption for some help with this.
As #BigBen pointed out, Excel is probably the wrong tool for this.

Is there Infinity in Spreadsheets?

I am wondering if there is any way to represent infinity (or a sufficiently high number) in MS Excel.
I am particularly looking for something like Double.POSITIVE_INFINITY or Double.MAX_VALUE in Java.
I like to use 1e99 as it gives the largest number with the fewest keystrokes but I believe the absolute maximum is actually 9.99999E+307. At that stage of the number spectrum I don't think there is much difference as far as Excel is concerned.
I think it's worth adding that, Infinity as well as other special values can be returned from a vba function (How do you get VB6 to initialize doubles with +infinity, -infinity and NaN?):
Function Infinity(Optional Recalc) As Double
On Error Resume Next
Infinity = 1/0
End Function
When entered as a cell formula a large number is shown (2^1024). You can set a conditional format to show "+Infinity" as a number format with a formula condition:
=AND(ISNUMBER(A1),A1>2^1023*(2-2^-52))
A dummy argument containing a dynamic reference can be inserted so that values are recalculated when the workbook is opened, for example:
=Infinity(IF(,) IF(,))
With LibreOffice 6 I use 1.79769313486231E+308 that seems the largest number it allows me to enter, but I miss not having an exact representation of +- infinite, also because I suspect the number above is implementation specific...
This is an other point that makes me think that spreadsheets are great for visualising, editing and simple computations on tabular data, but for doing more complex operations/modelling a real programming language is a must...

Fast repeated row counting in vast data - what format?

My Node.js app needs to index several gigabytes of timestamped CSV data, in such a way that it can quickly get the row count for any combination of values, either for each minute in a day (1440 queries) or for each hour in a couple of months (also 1440). Let's say in half a second.
The column values will not be read, only the row counts per interval for a given permutation. Reducing time to whole minutes is OK. There are rather few possible values per column, between 2 and 10, and some depend on other columns. It's fine to do preprocessing and store the counts in whatever format suitable for this single task - but what format would that be?
Storing actual values is probably a bad idea, with millions of rows and little variation.
It might be feasible to generate a short code for each combination and match with regex, but since these codes would have to be duplicated each minute, I'm not sure it's a good approach.
Or it can use an embedded database like SQLite, NeDB or TingoDB, but am not entirely convinced since they don't have native enum-like types and might or might not be made for this kind of counting. But maybe it would work just fine?
This must be a common problem with an idiomatic solution, but I haven't figured out what it might be called. Knowing what to call this and how to think about it would be very helpful!
Will answer with my own findings for now, but I'm still interested to know more theory about this problem.
NeDB was not a good solution here as it saved my values as normal JSON behind the hood, repeating key names for each row and adding unique IDs. It wasted lots of space and would surely have been too slow, even if just because of disk I/O.
SQLite might be better at compressing and indexing data, but I have yet to try it. Will update with my results if I do.
Instead I went with the other approach I mentioned: assign a unique letter to each column value we come across and get a short string representing a permutation. Then for each minute, add these strings as keys iff they occur, with the number of occurrences as values. We can later use our dictionary to create a regex that matches any set of combinations, and run it over this small index very quickly.
This was easy enough to implement, but would of course have been trickier if I had had more possible column values than the about 70 I found.

Resources