Is it possible to do a Levenshtein distance in Excel without having to resort to Macros? - excel

Let me explain.
I have to do some fuzzy matching for a company, so ATM I use a levenshtein distance calculator, and then calculate the percentage of similarity between the two terms. If the terms are more than 80% similar, Fuzzymatch returns "TRUE".
My problem is that I'm on an internship, and leaving soon. The people who will continue doing this do not know how to use excel with macros, and want me to implement what I did as best I can.
So my question is : however inefficient the function may be, is there ANY way to make a standard function in Excel that will calculate what I did before, without resorting to macros ?
Thanks.

If you came about this googling something like
levenshtein distance google sheets
I threw this together, with the code comment from milot-midia on this gist (https://gist.github.com/andrei-m/982927 - code under MIT license)
From Sheets in the header menu, Tools -> Script Editor
Name the project
The name of the function (not the project) will let you use the func
Paste the following code
function Levenshtein(a, b) {
if(a.length == 0) return b.length;
if(b.length == 0) return a.length;
// swap to save some memory O(min(a,b)) instead of O(a)
if(a.length > b.length) {
var tmp = a;
a = b;
b = tmp;
}
var row = [];
// init the row
for(var i = 0; i <= a.length; i++){
row[i] = i;
}
// fill in the rest
for(var i = 1; i <= b.length; i++){
var prev = i;
for(var j = 1; j <= a.length; j++){
var val;
if(b.charAt(i-1) == a.charAt(j-1)){
val = row[j-1]; // match
} else {
val = Math.min(row[j-1] + 1, // substitution
prev + 1, // insertion
row[j] + 1); // deletion
}
row[j - 1] = prev;
prev = val;
}
row[a.length] = prev;
}
return row[a.length];
}
You should be able to run it from a spreadsheet with
=Levenshtein(cell_1,cell_2)

While it can't be done in a single formula for any reasonably-sized strings, you can use formulas alone to compute the Levenshtein Distance between strings using a worksheet.
Here is an example that can handle strings up to 15 characters, it could be easily expanded for more:
https://docs.google.com/spreadsheet/ccc?key=0AkZy12yffb5YdFNybkNJaE5hTG9VYkNpdW5ZOWowSFE&usp=sharing
This isn't practical for anything other than ad-hoc comparisons, but it does do a decent job of showing how the algorithm works.

looking at the previous answers to calculating Levenshtein distance, I think it would be impossible to create it as a formula.
Take a look at the code here

Actually, I think I just found a workaround. I was adding it in the wrong part of the code...
Adding this line
} else if(b.charAt(i-1)==a.charAt(j) && b.charAt(i)==a.charAt(j-1)){
val = row[j-1]-0.33; //transposition
so it now reads
if(b.charAt(i-1) == a.charAt(j-1)){
val = row[j-1]; // match
} else if(b.charAt(i-1)==a.charAt(j) && b.charAt(i)==a.charAt(j-1)){
val = row[j-1]-0.33; //transposition
} else {
val = Math.min(row[j-1] + 1, // substitution
prev + 1, // insertion
row[j] + 1); // deletion
}
Seems to fix the problem. Now 'biulding' is 92% accurate and 'bilding' is 88%. (whereas with the original formula 'biulding' was only 75%... despite being closer to the correct spelling of building)

Related

How do I generate a string that contains a keyword?

I'm currently making a program with many functions that utilise Math.rand(). I'm trying to generate a string with a given keyword (in this case, lathe). I want the program to log a string that has "lathe" (or any version of it, with capitals or not), but everything I've tried has the program hit its call stack size limit (I understand exactly why, I want the program to generate a string with the word without it hitting its call stack size).
What I have tried:
function generateStringWithKeyword(randNum: number) {
const chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+/";
let result = "";
for(let i = 0; i < randNum; i++) {
result += chars[Math.floor(Math.random() * chars.length)];
if(result.includes("lathe")) {
continue;
} else {
generateStringWithKeyword(randNum);
}
}
console.log(result);
}
This is what I have now, after doing brief research on stackoverflow I learned that it might have been better to add the if/else block with a continue, rather than using
if(!result.includes("lathe")) return generateStringWithKeyword(randNum);
But both ways I had hit the call stack size limit.
A "correct" version of your algorithm, written as an iterative function instead of as a recursive one so as not to exceed stack depth, would look something like this:
function generateStringWithKeyword(randNum: number) {
const chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+/";
let result = "";
let attemptCnt = 0;
while (!result.toLowerCase().includes("lathe")) {
attemptCnt++;
result = "";
for (let i = 0; i < randNum; i++) {
result += chars[Math.floor(Math.random() * chars.length)];
}
if (attemptCnt > 1e6) {
console.log("I GIVE UP");
return;
}
}
console.log(result);
return result;
}
I don't like when my browser hangs because of a script that won't finish, so I put a maximum attempt count in there. A million chances seems reasonable. When you try it out, this happens:
generateStringWithKeyword(10); // I GIVE UP
Which makes sense; let's perform a rough back-of-the-envelope probability calculation to see how long we might expect this to take. The chance that "lathe" will appear in some case at position 1 of the word is (2/64)×(2/64)×(2×64)×(2/64)×(2/64) ("L" or "l" appears first, followed by "A" or "a", etc) which is approximately 3×10-8. For a word of length 10, "lathe" can appear starting at positions 1, 2, 3, 4, 5, or 6. While this isn't exactly correct, let's think of this as multiplying your chances by 6 of getting the word somewhere, so the actual chance of getting a valid result is somewhere around 1.8×10-7. So we can expect that you'd need to make approximately 1 ÷ 1.8×10-7 = 5.6 million chances to succeed.
Oh, darn, I only gave it a million. Let's up that to 10 million and try again:
generateStringWithKeyword(10); // "lATHELEYSc"
Great! Although, it does sometimes still give up. And really, an algorithm which needs millions of tries before it succeeds is very, very inefficient. You might want to read about bogosort, a sorting algorithm which works by randomly shuffling things and checking to see if they are sorted, and it keeps trying until it works. It's used for educational purposes to highlight how such techniques don't really perform well enough to be practical. Nobody would ever want to use such an algorithm for real.
So how would you do this "the right" way? Well, my suggestion here is to just build your result correctly the first time. If you have 10 characters and 5 of them need to be "lathe" in some case, then you will need 5 truly random characters. So randomly decide how many of those letters should be before "lathe". If you pick 2, for example, then put 2 random characters, plus "lathe" in a random case, plus 3 more random characters.
It could be something like this, where I mostly use your same style of for-loops and += string concatenation:
function generateStringWithKeyword(randNum: number) {
const keyword = "lathe";
if (randNum < keyword.length) throw new Error(
"This is not possible; \"" + keyword + "\" doesn't fit in " + randNum + " characters"
);
const actuallyRandNum = randNum - keyword.length;
const chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+/";
let result = "";
const kwInsertionPoint = Math.floor(Math.random() * (actuallyRandNum + 1));
for (let i = 0; i < kwInsertionPoint; i++) {
result += chars[Math.floor(Math.random() * chars.length)];
}
for (let i = 0; i < keyword.length; i++) {
result += Math.random() < 0.5 ? keyword[i].toLowerCase() : keyword[i].toUpperCase();
}
for (let i = kwInsertionPoint; i < actuallyRandNum; i++) {
result += chars[Math.floor(Math.random() * chars.length)];
}
return result;
}
If you run this, you will see that it is very efficient, and never gives up:
console.log(Array.from({ length: 4 }, () => generateStringWithKeyword(5)).join(" "));
// "lathE LaThe lATHe LatHe"
console.log(Array.from({ length: 4 }, () => generateStringWithKeyword(7)).join(" "));
// "p6lAtHe laThE01 nlaTheK lATHeRJ"
console.log(Array.from({ length: 4 }, () => generateStringWithKeyword(10)).join(" "));
// "giMqzLaTHe 5klAthegBo oVdLatHe0q twNlATheCr"
Playground link to code

Optimal algorithm for this string decompression

I have been working on an exercise from google's dev tech guide. It is called Compression and Decompression you can check the following link to get the description of the problem Challenge Description.
Here is my code for the solution:
public static String decompressV2 (String string, int start, int times) {
String result = "";
for (int i = 0; i < times; i++) {
inner:
{
for (int j = start; j < string.length(); j++) {
if (isNumeric(string.substring(j, j + 1))) {
String num = string.substring(j, j + 1);
int times2 = Integer.parseInt(num);
String temp = decompressV2(string, j + 2, times2);
result = result + temp;
int next_j = find_next(string, j + 2);
j = next_j;
continue;
}
if (string.substring(j, j + 1).equals("]")) { // Si es un bracket cerrado
break inner;
}
result = result + string.substring(j,j+1);
}
}
}
return result;
}
public static int find_next(String string, int start) {
int count = 0;
for (int i = start; i < string.length(); i++) {
if (string.substring(i, i+1).equals("[")) {
count= count + 1;
}
if (string.substring(i, i +1).equals("]") && count> 0) {
count = count- 1;
continue;
}
if (string.substring(i, i +1).equals("]") && count== 0) {
return i;
}
}
return -111111;
}
I will explain a little bit about the inner workings of my approach. It is a basic solution involves use of simple recursion and loops.
So, let's start from the beggining with a simple decompression:
DevTech.decompressV2("2[3[a]b]", 0, 1);
As you can see, the 0 indicates that it has to iterate over the string at index 0, and the 1 indicates that the string has to be evaluated only once: 1[ 2[3[a]b] ]
The core here is that everytime you encounter a number you call the algorithm again(recursively) and continue where the string insides its brackets ends, that's the find_next function for.
When it finds a close brackets, the inner loop breaks, that's the way I choose to make the stop sign.
I think that would be the main idea behind the algorithm, if you read the code closely you'll get the full picture.
So here are some of my concerns about the way I've written the solution:
I could not find a more clean solution to tell the algorithm were to go next if it finds a number. So I kind of hardcoded it with the find_next function. Is there a way to do this more clean inside the decompress func ?
About performance, It wastes a lot of time by doing the same thing again, when you have a number bigger than 1 at the begging of a bracket.
I am relatively to programming so maybe this code also needs an improvement not in the idea, but in the ways It's written. So would be very grateful to get some suggestions.
This is the approach I figure out but I am sure there are a couple more, I could not think of anyone but It would be great if you could tell your ideas.
In the description it tells you some things that you should be awared of when developing the solutions. They are: handling non-repeated strings, handling repetitions inside, not doing the same job twice, not copying too much. Are these covered by my approach ?
And the last point It's about tets cases, I know that confidence is very important when developing solutions, and the best way to give confidence to an algorithm is test cases. I tried a few and they all worked as expected. But what techniques do you recommend for developing test cases. Are there any softwares?
So that would be all guys, I am new to the community so I am open to suggestions about the how to improve the quality of the question. Cheers!
Your solution involves a lot of string copying that really slows it down. Instead of returning strings that you concatenate, you should pass a StringBuilder into every call and append substrings onto that.
That means you can use your return value to indicate the position to continue scanning from.
You're also parsing repeated parts of the source string more than once.
My solution looks like this:
public static String decompress(String src)
{
StringBuilder dest = new StringBuilder();
_decomp2(dest, src, 0);
return dest.toString();
}
private static int _decomp2(StringBuilder dest, String src, int pos)
{
int num=0;
while(pos < src.length()) {
char c = src.charAt(pos++);
if (c == ']') {
break;
}
if (c>='0' && c<='9') {
num = num*10 + (c-'0');
} else if (c=='[') {
int startlen = dest.length();
pos = _decomp2(dest, src, pos);
if (num<1) {
// 0 repetitions -- delete it
dest.setLength(startlen);
} else {
// copy output num-1 times
int copyEnd = startlen + (num-1) * (dest.length()-startlen);
for (int i=startlen; i<copyEnd; ++i) {
dest.append(dest.charAt(i));
}
}
num=0;
} else {
// regular char
dest.append(c);
num=0;
}
}
return pos;
}
I would try to return a tuple that also contains the next index where decompression should continue from. Then we can have a recursion that concatenates the current part with the rest of the block in the current recursion depth.
Here's JavaScript code. It takes some thought to encapsulate the order of operations that reflects the rules.
function f(s, i=0){
if (i == s.length)
return ['', i];
// We might start with a multiplier
let m = '';
while (!isNaN(s[i]))
m = m + s[i++];
// If we have a multiplier, we'll
// also have a nested expression
if (s[i] == '['){
let result = '';
const [word, nextIdx] = f(s, i + 1);
for (let j=0; j<Number(m); j++)
result = result + word;
const [rest, end] = f(s, nextIdx);
return [result + rest, end]
}
// Otherwise, we may have a word,
let word = '';
while (isNaN(s[i]) && s[i] != ']' && i < s.length)
word = word + s[i++];
// followed by either the end of an expression
// or another multiplier
const [rest, end] = s[i] == ']' ? ['', i + 1] : f(s, i);
return [word + rest, end];
}
var strs = [
'2[3[a]b]',
'10[a]',
'3[abc]4[ab]c',
'2[2[a]g2[r]]'
];
for (const s of strs){
console.log(s);
console.log(JSON.stringify(f(s)));
console.log('');
}

AS3 "Advanced" string manipulation

I'm making an air dictionary and I have a(nother) problem. The main app is ready to go and works perfectly but when I tested it I noticed that it could be better. A bit of context: the language (ancient egyptian) I'm translating from does not use punctuation so a phrase canlooklikethis. Add to that the sheer complexity of the glyph system (6000+ glyphs).
Right know my app works like this :
user choose the glyphs composing his/r word.
app transforms those glyphs to alphanumerical values (A1 - D36 - X1A, etc).
the code compares the code (say : A5AD36) to a list of xml values.
if the word is found (A5AD36 = priestess of Bast), the user gets the translation. if not, s/he gets all the possible words corresponding to the two glyphs (A5A & D36).
If the user knows the string is a word, no problem. But if s/he enters a few words, s/he'll have a few more choices than hoped (exemple : query = A1A5AD36 gets A1 - A5A - D36 - A5AD36).
What I would like to do is this:
query = A1A5AD36 //word/phrase to be translated;
varArray = [A1, A5A, D36] //variables containing the value of the glyphs.
Corresponding possible words from the xml : A1, A5A, D36, A5AD36.
Possible phrases: A1 A5A D36 / A1 A5AD36 / A1A5A D36 / A1A5AD36.
Possible phrases with only legal words: A1 A5A D36 / A1 A5AD36.
I'm not I really clear but to things simple, I'd like to get all the possible phrases containing only legal words and filter out the other ones.
(example with english : TOBREAKFAST. Legal = to break fast / to breakfast. Illegal = tobreak fast.
I've managed to get all the possible words, but not the rest. Right now, when I run my app, I have an array containing A1 - A5A - D36 - A5AD36. But I'm stuck going forward.
Does anyone have an idea ? Thank you :)
function fnSearch(e: Event): void {
var val: int = sp.length; //sp is an array filled with variables containing the code for each used glyph.
for (var i: int = 0; i < val; i++) { //repeat for every glyph use.
var X: String = ""; //variable created to compare with xml dictionary
for (var i2: int = 0; i2 < val; i2++) { // if it's the first time, use the first glyph-code, else the one after last used.
if (X == "") {
X = sp[i];
} else {
X = X + sp[i2 + i];
}
xmlresult = myXML.mot.cd; //xmlresult = alphanumerical codes corresponding to words from XMLList already imported
trad = myXML.mot.td; //same with traductions.
for (var i3: int = 0; i3 < xmlresult.length(); i3++) { //check if element X is in dictionary
var codeElement: XML = xmlresult[i3]; //variable to compare with X
var tradElement: XML = trad[i3]; //variable corresponding to codeElement
if (X == codeElement.toString()) { //if codeElement[i3] is legal, add it to array of legal words.
checkArray.push(codeElement); //checkArray is an array filled with legal words.
}
}
}
}
var iT2: int = 500 //iT2 set to unreachable value for next lines.
for (var iT: int = 0; iT < checkArray.length; iT++) { //check if the word searched by user is in the results.
if (checkArray[iT] == query) {
iT2 = iT
}
}
if (iT2 != 500) { //if complete query is found, put it on top of the array so it appears on top of the results.
var oldFirst: String = checkArray[0];
checkArray[0] = checkArray[iT2];
checkArray[iT2] = oldFirst;
}
results.visible = true; //make result list visible
loadingResults.visible = false; //loading screen
fnPossibleResults(null); //update result list.
}
I end up with an array of variables containing the glyph-codes (sp) and another with all the possible legal words (checkArray). What I don't know how to do is mix those two to make legal phrases that way :
If there was only three glyphs, I could probably find a way, but user can enter 60 glyphs max.

Iterative deepening search selected bad moves

I'm writing a Nine Men's Morris game and so far I have a Negascout search that works just fine. However, I would like to added iterative deepening, so I came up with this code:
public Move GetBestMove(IBoard board, int depth)
{
//Search limits (ms
this.maxTime = 9000;
//Set initial window
int alpha = -INFINITY, beta = INFINITY;
int val = 0;
//The move that will be returned
Move bestMove = null;
//Get list of moves for the current board
List<Move> moves = board.getMoves();
//Get the time search has started
long startTime = System.nanoTime();
//Iterate through the depths
for (curDepth = 1; ; )
{
maxDepth = curDepth;
//Reset alpha
alpha = -INFINITY;
//Reset the best score position
int bestPos = -1;
//Loop through all the moves
for (int i = 0, n = moves.size(); i < n; i++)
{
//Make the move
board.make(moves.get(i), true);
//Search deeper
val = negascout(board, curDepth, alpha, beta, startTime);
//Undo the move
board.undo(moves.get(i));
//Keep best move
if (val > alpha)
{
bestMove = moves.get(i);
bestPos = i;
}
//Score missed aspiration window
if (val <= alpha || val >= beta)
{
alpha = -INFINITY;
beta = INFINITY;
//Go to next iteration
continue;
}
//Set new aspiration window
alpha = val - ASPIRATION_SIZE;
if (alpha < -INFINITY)
alpha = -INFINITY;
beta = val + ASPIRATION_SIZE;
if (beta > INFINITY)
beta = INFINITY;
}
//Move the best move to the top of the list
if (bestPos != -1)
{
moves.remove(bestPos);
moves.add(0, bestMove);
}
//Time check
double curTime = (System.nanoTime() - startTime) / 1e6;
if (curTime >= maxTime ||
val == board.getMaxScoreValue() ||
val == -board.getMaxScoreValue())
break;
//Increment current depth
curDepth++;
}
//Return the move
return bestMove;
}
I also use an aspiration window. However, the search returns the worst possible move!! I think that the problem is with re-/setting the search window. Should the search window be moved to the outer loop?
Since you're using negascout, your initial call should look like
val = -negascout(board, curDepth - 1, -beta, -alpha, startTime);
Your root call is the exact opposite compared to internal nodes, so that explains why it's returning the worst possible move.
The iterative deepening strategy:
for (depth = 1;; depth++) {
val = AlphaBeta(depth, -INFINITY, INFINITY); // or negascout
if (TimedOut())
break;
}
looks different to the one you implemented with GetBestMove. The inner loop (iterating through the possible moves) should be part of negascout. Further it seems, that you only store the move ordering at first depth level (1-ply), but to make the iterative deepening search really fast, it needs a move ordering at every depth searched so far. Iterative deepening not only has the advantage to take time into account (finish after x seconds), but also has the advantage of generating a good move ordering. And the alphabeta or negascout algorithm benefits from a good move ordering (try this move first because in a previous search it was the best). A common way to get a move ordering implemented is the transposition table.
The documents The Main Transposition Table and Iterative Deepening from Bruce Moreland where very helpful to me and I hope that the links can help you too!

Search an integer in a row-sorted two dim array, is there any better approach?

I have recently come across with this problem,
you have to find an integer from a sorted two dimensional array. But the two dim array is sorted in rows not in columns. I have solved the problem but still thinking that there may be some better approach. So I have come here to discuss with all of you. Your suggestions and improvement will help me to grow in coding. here is the code
int searchInteger = Int32.Parse(Console.ReadLine());
int cnt = 0;
for (int i = 0; i < x; i++)
{
if (intarry[i, 0] <= searchInteger && intarry[i,y-1] >= searchInteger)
{
if (intarry[i, 0] == searchInteger || intarry[i, y - 1] == searchInteger)
Console.WriteLine("string present {0} times" , ++cnt);
else
{
int[] array = new int[y];
int y1 = 0;
for (int k = 0; k < y; k++)
array[k] = intarry[i, y1++];
bool result;
if (result = binarySearch(array, searchInteger) == true)
{
Console.WriteLine("string present inside {0} times", ++ cnt);
Console.ReadLine();
}
}
}
}
Where searchInteger is the integer we have to find in the array. and binary search is the methiod which is returning boolean if the value is present in the single dimension array (in that single row).
please help, is it optimum or there are better solution than this.
Thanks
Provided you have declared the array intarry, x and y as follows:
int[,] intarry =
{
{0,7,2},
{3,4,5},
{6,7,8}
};
var y = intarry.GetUpperBound(0)+1;
var x = intarry.GetUpperBound(1)+1;
// intarry.Dump();
You can keep it as simple as:
int searchInteger = Int32.Parse(Console.ReadLine());
var cnt=0;
for(var r=0; r<y; r++)
{
for(var c=0; c<x; c++)
{
if (intarry[r, c].Equals(searchInteger))
{
cnt++;
Console.WriteLine(
"string present at position [{0},{1}]" , r, c);
} // if
} // for
} // for
Console.WriteLine("string present {0} times" , cnt);
This example assumes that you don't have any information whether the array is sorted or not (which means: if you don't know if it is sorted you have to go through every element and can't use binary search). Based on this example you can refine the performance, if you know more how the data in the array is structured:
if the rows are sorted ascending, you can replace the inner for loop by a binary search
if the entire array is sorted ascending and the data does not repeat, e.g.
int[,] intarry = {{0,1,2}, {3,4,5}, {6,7,8}};
then you can exit the loop as soon as the item is found. The easiest way to do this to create
a function and add a return statement to the inner for loop.

Resources