"Fuzzy Lookup" add in results - excel

Using Excel 2010, and the Microsoft "Fuzzy Lookup" add in to compare a column out of 2 worksheets. First worksheet has around 48,000 rows (x 3 columns), second worksheet has around 23,000 rows (x 5 columns). The "Fuzzy Lookup" is comparing one column from each & returning a similarity between the two.
The fuzzy lookup appears to run without a problem, and the results - in most cases - appear to be correct. For example:
W2-NK22/16 in one worksheet shows to have a 0.97 similarity to W2NK2216.
But not in all cases. Some that I expected to have some degree of similarity, instead have 0.000 returned by the Add-In. For example:
761689700000
should have some degree of similarity to:
761689700000EN4239
but the Fuzzy Lookup add in returns 0.000 for it. Both fields are formatted as text. Neither have spaces before or after them, and the first 12 characters are identical.
I have uninstalled & reinstalled the add-in, and have used the default settings. The only other Fuzzy Lookup settings I have changed were in Configure --> Global -- UseApproximateIndexing. I have set it to both False and True which have had no impact.
I have hundreds of examples like the one above that show 0.000 similarity, but upon inspection appear to be very similar. Rows before & after them show various degrees of similarity.
Any thoughts or ideas as to why this doesn't appear to function correctly, or a better way to do this approximate match would be appreciated.

Trying to add content even though this case is 2 years old. Hopefully someone else can use it.
For Transformations, Tokenization, etc - look in the same folder where Fuzzy Lookup is installed. There is an example file there called Portfolio.xlsx and a corresponding Readme.docx file. Those are very helpful. Frankly the documentation on the Fuzzy Lookup add-in is terrible (but it is free). The Readme file talks about an entitlement called "EditTransformationProvider" that might help this kind of problem.
I've implemented Fuzzy on a couple processes at my work and we have saved hundreds of man-hours when working in Excel. It's no joke.

Related

Excel interpolation with results in situ

Extrapolation in Excel is easy: have a list of numbers (and optionally their paired "X-values"), and it can easily generate further entries in the list with the GROWTH() function.
GROWTH() works for interpolation too: you just need to tell it the intermediate X-values that you want it to calculate for. My problem with it is the appearance of the data in the spreadsheet. Here's an example:
Say I have some inputs, and through some process get some outputs. Only, there were gaps in the experiment so no outputs were generated for some values:
Out of curiosity, I copied the data to the right, and used Excel's "Extend with Growth Trend": I highlighted the first two entries (only), then right-click-dragged-down the little square over the next four cells (overriding the final value there) and chose "Growth Trend" in the context menu. To remind myself that the values were Excel-generated, I gave them a grey background:
Hmm. The generated values (unsurprisingly) aren't a good extrapolation, since they don't factor in the later value. It's out by over 40%! Also note that this Extend feature of Excel is an ease-of-input mechanism, not a calculation tool in its own right - Excel enters the data as raw numbers (to multiple decimal places).
So I formalised the Extend column by using the GROWTH() function - again only factoring in the first two values, but also using their paired X-values and the desired interpolation entry as parameters:
D4: =GROWTH(D$2:D$3,$A$2:$A$3,$A4)
D5: =GROWTH(D$2:D$3,$A$2:$A$3,$A5)
D6: =GROWTH(D$2:D$3,$A$2:$A$3,$A6)
Thankfully, the results mimic those of the previous column (Microsoft use the same mechanism for both features!) I didn't overwrite the last entry, since after all it has the value that I actually want! The fact that the calculated values are the same as before is the problem I'm trying to fix, and that this question is about.
To improve the calculated values, I need to incorporate the last value - but at the same time I want the "natural" sequence of input values to be maintained. In other words, I want the interpolated values to be placed in situ. That implies that the arguments to the GROWTH() function need to be discontiguous ranges, which Excel does by using the (Range,Range,...) syntax. I tried it, and got #REF! errors. I then tried using a named discontiguous range: same result.
After a bit of Googling (and StackOverflowing!) I found references to using INDIRECT() - a particularly problematic 'solution', since it evaluates strings that would need to be manually maintained. Nevertheless:
E4: =GROWTH(INDIRECT({"E2:E3","E7"}),INDIRECT({"A2:A3","A7"}),A4)
E5: =GROWTH(INDIRECT({"E2:E3","E7"}),INDIRECT({"A2:A3","A7"}),A5)
E6: =GROWTH(INDIRECT({"E2:E3","E7"}),INDIRECT({"A2:A3","A7"}),A6)
…and after all that it didn't work anyway! The values remained the same as the previous version, that didn't incorporate the last value. Maybe the last value doesn't make for better interpolation results? So, as an experiment, I ignored the "in situ" requirement and generated an "ex situ" version, with the known values followed by the desired values, allowing me to use simple ranges. Success! But to highlight that the data is in the wrong order, I asked Excel to create an X-Y plot of the data too:
B13: =GROWTH(B$10:B$12,$A$10:$A$12,$A13)
B14: =GROWTH(B$10:B$12,$A$10:$A$12,$A14)
B15: =GROWTH(B$10:B$12,$A$10:$A$12,$A15)
Of course, the results are exponential not linear, so setting the Y-axis to logarithmic generates a very readable result - and it effectively masks the back-and-forth of the data. But deep down, we both know that the data is wrong - just look at the table!
Maybe, just maybe, if I used Excel's "Sort Data" feature it would break up the range for me, and show me how I should have written the formulae? Sadly, although it looks like it worked, I get a "Circular reference" error for B12 - the range wasn't modified to make it discontiguous, and now B12's result is dependent on the original range which includes itself! I coloured it below to indicate that this isn't a viable solution:
So, my "final" solution is to maintain the previous "ex situ" version, and simply have an "in situ" column as well that does a VLOOKUP() on the ExSitu (named) table - and I needed to tell it to do an exact match with the FALSE parameter, since the list isn't sorted:
F4: =VLOOKUP($A4,ExSitu,2,FALSE)
F5: =VLOOKUP($A5,ExSitu,2,FALSE)
F6: =VLOOKUP($A6,ExSitu,2,FALSE)
Note that I labelled the column with an asterisk since it's a cheat: the values are only in situ by copying from another table.
Phew! After all that, my question:
Is there a way to directly interpolate the "in situ" values, without having to have an "ex situ" lookup table to generate the results? The above example was deliberately straightforward: you can easily imagine a longer list with more gaps to be filled in.
Since you had a good data sense, I'll share my discovery path on this case. I'm more like a visual person. I don't see things 'that' clear via tables. Here is what I do to you data points. :
Input Raw
360 7.16
370 28.9
380
390
400
410 5,380.00
Highlight all and press my favorite button > F11. I choose line chart type. Then with the plus button on the top left of the chart, I add trendline > more options.. From there I choose 'polynomial' and 'exponential' . Plus, a tick on 'display equation on chart' As you can see in the links, both fit seem ok. just take the equation and fit in for other values as needed.
Three things I've noticed :
The polynomial and exponential fit is close enough to what I need. But it doesn't exactly 'map' on the ( 410, 5380.00 ) point.
By having the formula I find it easier to make sense of whether or not the trendline 'proposed' by excel is a close fit to my need. As you play around you can see how far-off the linear & logarithmic trendline can be.
The trendline equation doesn't really map to 360,370,410... point as the x value, it assumes x is 0,1,2,3... (try to test it with the 'equation' of the excel proposed trendline)
IMHO, use excel trend with care. My next best fitting tool -> wolframalpha logarithmic fit.
For the original question :
Is there a way to directly interpolate the "in situ" values, without having to have an "ex situ" lookup table to generate the results?
I think my simple answer will be : Indirectly, Yes. Directly? not sure.
Hope this heals/helps in some ways.. ( :

Replacing numeric values in Excel sheet with text values from other sheet

I am using Surveymonkey for a questionnaire. Most of my data has a regular scale from 0-6, and additionally an "Other" option that people can use in case they choose to not answer the item. However, when I download the data, Surveymonkey automatically assigns a value of 0 to that not-answer category, and it appears this cant be changed.
This leads to me not knowing when a zero in my numeric dataset actually means zero or just participants choosing to not answer the question. I can only figure that out by looking at another file that includes the labels of participants answers (all answers are provided by the corresponding labels, so this datafile misses all non-labeled answers...).
This leads me to my problem: I have two excel files of same size. I would need to find a way to find certain values in one dataset (text value, scattered randomly over dataset), and replace the corresponding numeric values in the other dataset (at the same position in the dataset) with those values.
I thought it would just be possible to find all values and copy paste in the same pattern, but I cannot seem to find a way to do that. I feel like I am missing an obvious solution, but after searching for quite a while I really could not find an answer to my specific question.
I have never worked with macros or more advanced excel programming before, but have a bit of knowledge about programming in itself. I hope I explained this well, I would be very thankful for any suggestions or scripts that could help me out here!
Thank you!
Alex
I don't know how your Excel file is organised, but if it's like the legacy Condensed format, all you should need to do is to select the column corresponding to a given question (if that's what you have), and search and replace all 0 (match entire cell) with the text you want.

How to optimize COUNTIFS with very large data

I would like to create a report that look like this picture below.
My data has around 500,000 cells (it will continue to grow larger)
Right now, I'm using countifs function from excel but it takes a very long time to calculate. (cannot turnoff automatic calculate)
The main value is collected as date and the range of date is about 3 years, so I have to put a lot of formula to cover all range of value.
result
The picture below is the datasource the top one cannot be changed. , while the bottom is the one I created by myself (can change). I use weeknum to change date to week number.
data
Are there any better formula or any ways to make this file faster? Every kinds of suggestions are welcome!
I was thinking about using Pivot Table, but I don't know how to make pivot table from this kind of datasource.
PS. VBA is the last option.
You can download example file here: https://www.mediafire.com/?t21s8ngn9mlme2d
I will post this answer with the disclaimer that it is entirely dependent on the size of the data set. That turning on and off the auto calculate is the best way, but your question doesn't let me do that, so keep reading.
Your question made me curious, so I gave it a try and timed it. I essentially set up two columns of over 100,000 rand numbers choosing from 1-1000 and then tried to do a countif on the two columns if they were equal. I made a macro that I can run that turns off the autocalculate, inserts the start time, calculates, and then inserts the finish time. I highlighted in yellow the time difference.
First I tried your way, two criteria, countifs:
Then I tried to combine (concatenate) the two columns to see if I could make it easier by only having one countif criteria and data set. It doesn't. see result below:
Finally, realizing what was going on. I decided to make the criteria only match the FIRST value in the number to look for. I was essentially reducing the number of characters to check per cell. This had a positive result. See below:
Therefore my suggestion is to limit the length of the words you are comparing in anyway possible. You are mostly looking at dates, so you might have to get creative, but this seems to be the best way possible without going to manual calculation.
I have worked with Excel sheets of a similar size. Especially if you are using the data on a regular basis, I would heartily recommend switching to a proper database SQL based, Access, or whatever fits your purpose. I does wonders for the speed and also you won't run into the size limits of Excel. :-)
You can import the data you have now fairly easy.
I am happy as a clam with my postgresql db.

Improve Vlookup on large file

I´ve a very large file that I reduced as much as possible to 3 columns and 80k rows.
I need to perform a vlookup in order to bring values from column 1 or 2 match some other spreadsheets values.
The thing is Excel doesn´t seem to support such large searches, and it stops responding - the computer has 4GB and a Quad core, and not much more running at the same time.
As far as I understand, as I´m not looking for exact matches, I should not use match-index.
The only thing I thouhgt could help but not sure about that, is dividing the file in 2-4, and asking Excel many parallel searches instead of a big one. Could this work?
What else should I try?
Thanks!!!
Sort your data and use True as the 4th VLOOKUP argument. This makes VLOOKUP use binary search rather than linear search and is lightning fast.
If you need to handle missing data you will need to use the double VLOOKUP trick, see
http://fastexcel.wordpress.com/2012/03/29/vlookup-tricks-why-2-vlookups-are-better-than-1-vlookup/

How do I parse every string while comparing two excel spreadsheets?

Good evening,
I'm attempting to compare two excel spreadsheets by using the IF and MATCH functions as follows:
=IF(ISERROR(MATCH(fromADP!$C2,fromSMS!$A$2:$A$4792,0)),"No match found",fromADP!$C2)
I have two worksheets (fromADP and fromSMS). I'm trying to compare the two worksheets to find out which records in the fromADP worksheet appear in the fromSMS worksheet. The MATCH function allows me only three options for the match_type arguement. I'm using 0, although I'm not sure I understand exactly how the other two options work. I tried them though without desirable results.
When I use match_type 0 I only get one match - but this is an exact match (as I would expect). My problem is, some of the records do in fact exist in both worksheets but there are minor differences (for example, "Tony's" vs. "Tonny's" or "Jimmy's Trucking, LLC" vs. "Jimmy's Trucking").
So I'm wondering, is there another way to do this or could there be - perhaps - a vbscript that would parse each string in my lookup_value? This way, I can find those records where there might be slight differences.
I'm afraid I may simply have to pull out my ruler and pencil and start combing through the spreadsheets, line-by-line. Any help would be appreciated.
Hi all,
Thanks to the solution offer here, I was able to use the Fuzzy Lookup Addin for Excel to accomplish this task. Thus, my question is answered and my issue resolved.

Resources