Calculating the highest 2-side average match - excel

"First part of the question is dedicated towards explaining the concept better, so we know, what we're calculating with. Feel free to
skip below to the latter parts, if you find it unnecessary"
1. Basic overview of the question:
Hello, I've got an excel application, something akin to a dating site. You can open various user profiles and even scan through the data and find the potential matches, based on hobbies, cities and other criteria.
How it's calculated is not relevant to the question, but the result of
a "Find Match" calculation looks something like this, a sorted list
of users, depending on how fitting they are (last column)
Relevant to the question are mainly:
the first column (ID) - ID of the user
the last column (Zhoda) - Match% of other users, against the one currently selected
2. What I need to do - how it's currently done
I need to find the highest match on average out of all users. If I were to write this algorhitmically:
1. Loop through all users
2. For each user in our database calculate the potential matches
3. Store the score of selected user ID, against all the found user IDs
4. Once it's all calculated, pit all users against each other _
and find the highest match on average
Obviously that sounds pretty complicated / vague, so here's a
simplified example. Let's say I have completed the first 3 steps and
have gotten the following result:
Here, the desired result would be:
User1 <- 46% -> User2
as they have the highest combined percentage average:
User1 vs User2: 30%
User2 vs User1: 62%
User1 <- (30+62)/2 -> User2
And no other possible combination of users has higher match% average
3. The purpose behind the question:
Now obviously you may ask, if I get the calculation behind it, then why ask the question in the first place? Well, the reason is combination of everything vs everything is extremely inefficient.
As soon as there are let's say 100 users instead of 3 in my database. I would have to do 100*100 calculations on match% alone, let alone afterwards check the average Match% of each individual user against another.
Is there perhaps some better way to approach, in a way I could either
minimize the data I have to calculate with
some sorting algorithm, where I could skip certain calculations in order to be quicker
an overall better approach towards calculating the highest average match%
So to recapitulate:
I've got a database of users.
Each individual user has a certain amount of Match% against every other user
I need to find two users, who one against another (on both sides) have the highest Match% average out of all possible combinations.
If you feel like you need any additional info, please let me know.
I'll try to keep the question updated as much as possible.

As you've presented the problem -- no, you cannot speed this up significantly. Since you've presented match% as an arbitrary function, constrained only by implied range, there are no mathematical properties you can harness to reduce the worst-case search scenario.
Under the given circumstances, the best you can do is to leverage the range. First, don't bother with "average": since these are strictly binary matches, dividing by 2 is simply a waste of time; keep the total.
Start by picking a pair; do the two-way match. Once you find a total of more than 100, store that value and use it to prune any sub-standard searches. For instance, if your best match so far totals 120, then if you find a couple where match(A, B) < 20, you don't bother with computing match(B, A).
In between, you can maintain a sorted list (O(n log n)) of first matches; don't do the second match unless you have reason to believe that this one might exceed your best match.
The rest of your optimization consists of gathering statistics about your matching, so that you can balance when to do first-only against two-way matches. For instance, you might defer the second match for any first match that is below the 70th percentile of those already deferred. This is in hope of finding a far better match that would entire eliminate this one.
If you gather statistics on the distribution of your match function, then you can tune this back-and-forth process better.
If you can derive mathematical properties about your match function, then there may be ways to leverage those properties for greater efficiency. However, since it's already short of being a formal topological "distance" metric d (see below), I don't hold out much hope for that.
Basic metric properties:
d(A, B) exists for all pairs (A, B)
d(A, B) = d(B, A)
d(A, A) = 0 // does not apply to a bipartite graph
d obeys the triangle inequality -- which doesn't apply directly, but has some indirect consequences for a bipartite graph.

Related

Game Results in Excel

I would much appreciate it if someone could help me with the needed formulas for this case.
I have multiple matches that I want to determine their winners based on their score. I also want to do a couple of stats based on the results.
A sample of how the game result should be entered
My requests are:
1- Return winner team name based on the original time result, if a tie then the extra time result, if a tie then penalties result. I also need the winner cell to have no values if no game result is entered.
2- If the game ends in original time, OTC counter increases by 1.
3- If the game ends in extra time, ETC counter increases by 1.
4- If the game ends in penalties, PC counter increases by 1.
I am guessing the counters would be done using the same method but you are the expert here.
Thank you so much for your time and effort.
(This information is too big to fit a comment, hence I put it as an answer)
I don't think you'll get answers on this site, as you have not done any effort yourself. But I have the impression that this is due to the fact that you don't know where to start, so let me give you some starting advice.
The functions you'll need to perform this task are mostly Max(), Sum(), IF(), CountIF() and maybe SumIF() (or CountIFS() and SumIFS() in case of multiple criteria).
As for the finding of the winner, you might use the Max() function in order to find the best result, and use a Lookup() function in order to know where you might encounter that result.
It might be helpful to add a helper column, containing a value (like 1) for all winning teams. By adding all those ones, you might fill the information in your other columns.
Now you have a starting point. Please try this out and if you have any specific questions, feel free to ask.
Thanks to Dominique, This is where I got so far regarding deciding on the match winner.
I used a combination of IF(), MAX() and LOOKUP functions. I am now determining results based on 2 cases: original time and penalties.
This is how the match appears
And this is how my formula looks like to determine the winner
=IF(C12=C13,IF(ISBLANK(D13),"",LOOKUP((MAX(D12,D13)),D12:D13,B12:B13)),IF(ISBLANK(C13),"",LOOKUP((MAX(C12,C13)),C12:C13,B12:B13)))
My issue now is that I want to count for human error when entering the results. With this formula, it returns Team B as the winner if the penalties result is a draw which cannot happen. I need it to show an error or not return an output if the result entered in the penalties score are equal.
Thank you for your support.

Spotfire DenseRank by category, do I use OVER?

I'm trying to rank some data in spotfire, and I'm having a bit of trouble writing a formula to calculate it. Here's a breakdown of what I am working with.
Group: the test group
SNP: what SNP I am looking at
Count: how many counts I get for the specific SNP
What I'd like to do is rank the average # of counts that are present for each SNP, within the group. Thus, I could then see, within a group, which SNP ranks #1, #2, etc.
Thanks!
TL;DR Disclaimer: You can do this, though if you are changing your cross table frequently, it may become a giant hassle. Make sure to double-check that logic is what you'd expect after any modification. Proceed with caution.
The basis of the Custom Expression you seem to be looking for is as follows:
Max(DenseRank(Count() OVER (Intersect([Group],[SNP])),"desc",[Group]))
This gives the total count of rows instead of the average; I was uncertain if "Count" was supposed to be a column or not. If you really do want to turn it into an average, make sure to adjust accordingly.
If all you have is the Group and the SNP nested on the left, you're done and good to go.
First issue, when you want to filter it down, it gives you the dense rank of only those in the filtered set. In some cases this is good, and what you're looking for; in others, it isn't. If you want it to hold fast to its value, regardless of filtering, you can use the same logic, but throw it in a Calculated column, instead of in the custom expression. Then, in your CrossTable Aggregation, get the max of the Calculated Column value.
Calculated Column:
DenseRank(Count() OVER (Intersect([Group],[SNP])),"desc",[Group])
Second Issue: You want to pivot by something other than Group and SNP. Perhaps, for example, by date? If you throw the Date across the top, it's going to show the same numbers for every month -- the overall numbers. This is not particularly helpful.
To a certain extent, Spotfire's Custom Expressions can handle this modification. If you switch between using a single column, you could use the following:
Max(DenseRank(Count() OVER (Intersect([${Axis.Columns.ShortDisplayName}],[Group],[SNP])),"desc",[Group],[${Axis.Columns.ShortDisplayName}]))
That would automatically pull in the column from the top, and show you the ranking for each individual process date.
However, if you start nesting, using hierarchies, renaming your columns, or having multiple aggregations and throwing (Column Names) across the top, you're going to start having to pay a great deal to your custom expression. You'll need to do some form of string replacement around the Axis.Column, or use expression instead of Short Names, and get rid of Nests, etc.
Any layer of complexity will require this sort of analysis, so if your end-users have access to modify the pivot table... honestly, I probably wouldn't give them this column.
Third Issue: I don't know if this is an issue, exactly, but you said "Average Counts" -- Average per day? Per Month? When averaging, you will need to decide if, for example, a month is the total number of days in month or the number of days that particular payor had data. However you decide to aggregate it, make sure you're doing it on the right level.
For the record, I liked the premise of this question; it's something I'd thought would be useful before, but never took the time to try to implement, since sorting a column or limiting a table to only show the top 10 values is much simpler

List of items find almost duplicates

Within excel I have a list of artists, songs, edition.
This list contains over 15000 records.
The problem is the list does contain some "duplicate" records. I say "duplicate" as they aren't a complete match. Some might have a few typo's and I'd like to fix this up and remove those records.
So for example some records:
ABBA - Mamma Mia - Party
ABBA - Mama Mia! - Official
Each dash indicates a separate column (so 3 columns A, B, C are filled in)
How would I mark them as duplicates within Excel?
I've found out about the tool Fuzzy Lookup. Yet I'm working on a mac and since it's not available on mac I'm stuck.
Any regex magic or vba script what can help me out?
It'd also be alright to see how much similar the row is (say 80% similar).
One of the common methods for fuzzy text matching is the Levenshtein (distance) algorithm. Several nice implementations of this exist here:
https://stackoverflow.com/a/4243652/1278553
From there, you can use the function directly in your spreadsheet to find similarities between instances:
You didn't ask, but a database would be really nice here. The reason is you can do a cartesian join (one of the very few valid uses for this) and compare every single record against every other record. For example:
select
s1.group, s2.group, s1.song, s2.song,
levenshtein (s1.group, s2.group) as group_match,
levenshtein (s1.song, s2.song) as song_match
from
songs s1
cross join songs s2
order by
group_match, song_match
Yes, this would be a very costly query, depending on the number of records (in your example 225,000,000 rows), but it would bubble to the top the most likely duplicates / matches. Not only that, but you can incorporate "reasonable" joins to eliminate obvious mismatches, for example limit it to cases where the group matches, nearly matches, begins with the same letter, etc, or pre-filtering out groups where the Levenschtein is greater than x.
You could use an array formula, to indicate the duplicates, and you could modify the below to show the row numbers, this checks the rows beneath the entry for any possible 80% dupes, where 80% is taken as left to right, not total comparison. My data is a1:a15000
=IF(NOT(ISERROR(FIND(MID($A1,1,INT(LEN($A1)*0.8)),$A2:$A$15000))),1,0)
This way will also look back up the list, to indicate the ones found
=SUM(IF(ISERROR(FIND(MID($A2,1,INT(LEN($A1)*0.8)),$A3:$A$15000,1)),0,1))+SUM(IF(ISERROR(FIND(MID($A2,1,INT(LEN($A2)*0.8)),$A$1:$A1,1)),0,1))
The first entry i.e. row 1 is the first part of the formula, and the last row will need the last part after the +
try this worksheet fucntions in your loop:
=COUNTIF(Range,"*yourtexttofind*")

VBA- picking one cell for each column that contain certain text in a Matrix with the max number of rows selected

I have a matrix of information that let user to input task they are willing to do. User have 3 choices:
A. I want to do this.
B. I do not mind doing this. and
C. I do not want to do this.
after I collect user data, I'd like to assign each task to each person base on their will (pick A over B) (pair up one task - one person). Are there any advice on how I can do it?
note:
column labels are user name and rows are tasks.
Obviously there are some task no one are willing to do and it is ok to leave it blank. (the number of task are expected to be greater than number of user, so some task will be blank anyhow)
I do not need all possible solution, just 1 solution will do
You probably don't need VBA to solve this. Assign values for "A.", "B.", "C." and blank cells, then look for the max and use a vlookup for finding the person; if given person is found then use an alternate sequence.
I would create an index for overall willingness (sum of assigned values) per person to make the values more unique.
If you would intend to make a macro I would start with the lowest willingness tasks and look for the highest value from people who are not yet assigned, if there are more than one of these people then I would give the job to the lowest overall willingness person (since he'll be harder to assign to other tasks).
P.S. the smiley doesn't make up for your lack of willingness to work. :(

Search selection

For a C# program that I am writing, I need to compare similarities in two entities (can be documents, animals, or almost anything).
Based on certain properties, I calculate the similarities between the documents (or entities).
I put their similarities in a table as below
X Y Z
A|0.6 |0.5 |0.4
B|0.6 |0.4 |0.2
C|0.6 |0.3 |0.6
I want to find the best matching pairs (eg: AX, BY, CZ) based on the highest similarity score. High score indicates the higher similarity.
My problem arises when there is a tie between similarity values. For example, AX and CZ have both 0.6. How do I decide which two pairs to select? Are there any procedures/theories for this kind of problems?
Thanks.
In general, tie-breaking methods are going to depend on the context of the problem. In some cases, you want to report all the tying results. In other situations, you can use an arbitrary means of selection such as which one is alphabetically first. Finally, you may choose to have a secondary characteristic which is only evaluated in the case of a tie in the primary characteristic.
Additionally, you can always report one or more and then alert the user that there was a tie to allow him or her to decide for him- or herself.
In this case, the similarities you should be looking for are:
- Value
- Row
- Column
Objects which have any of the above in common are "similar". You could assign a weighting to each property, so that objects which have the same value are more similar than objects which are in the same column. Also, objects which have the same value and are in the same column are more similar than objects with just the same value.
Depending on whether there are any natural ranges occurring in your data, you could also consider comparing ranges. For example two numbers in the range 0-0.5 might be somewhat similar.

Resources