Numerical hash for comparing lexical similarity - search

Is there some form of hashing algorithm that produces similar numerical values for similar words? I imagine there would be a number of false positives, but it seems like something that could be useful for search pruning.
EDIT: Soundex is neat and may come in handy, but ideally, I want something that behave something like this: abs(f('horse') - f('hoarse')) < abs(f('horse') - f('goat'))

The Soundex algorithm generates strings of keys corresponding to the phonemes in the input word. http://www.archives.gov/research/census/soundex.html
If you only want to compare similarity between strings, try Levenstein Distance. http://en.wikipedia.org/wiki/Levenshtein_distance

What you are talking about is called Locality-sensitive Hashing. It can be applied to different types of input (images, music, text, positions in space, whatever you need).
Unfortunately (and despite searching) I couldn't find any practical implementation of an LSH algorithm for strings.

You could always try Soundex and see if it fits your needs.

Checkout the Soundex algorithm on wikipedia, you haven't specified a language, but there are links to example implementations in multiple languages there. Obviously, this will give you a string hash thats the same for similar sounding words, and you want an integer, but you could then apply the string->integer hashing method they use in Boost.Hash.
Edit: To clarify, here is an example C++ implementation...
#include <boost/foreach.hpp>
#include <boost/functional/hash.hpp>
#include <algorithm>
#include <string>
#include <iostream>
char SoundexChar(char ch)
{
switch (ch)
{
case 'B':
case 'F':
case 'P':
case 'V':
return '1';
case 'C':
case 'G':
case 'J':
case 'K':
case 'Q':
case 'S':
case 'X':
case 'Z':
return '2';
case 'D':
case 'T':
return '3';
case 'M':
case 'N':
return '5';
case 'R':
return '6';
default:
return '.';
}
}
std::size_t SoundexHash(const std::string& word)
{
std::string soundex;
soundex.reserve(word.length());
BOOST_FOREACH(char ch, word)
{
if (std::isalpha(ch))
{
ch = std::toupper(ch);
if (soundex.length() == 0)
{
soundex.append(1, ch);
}
else
{
ch = SoundexChar(ch);
if (soundex.at(soundex.length() - 1) != ch)
{
soundex.append(1, ch);
}
}
}
}
soundex.erase(std::remove(soundex.begin(), soundex.end(), '.'), soundex.end());
if (soundex.length() < 4)
{
soundex.append(4 - soundex.length(), '0');
}
else if (soundex.length() > 4)
{
soundex = soundex.substr(0, 4);
}
return boost::hash_value(soundex);
}
int main()
{
std::cout << "Color = " << SoundexHash("Color") << std::endl;
std::cout << "Colour = " << SoundexHash("Colour") << std::endl;
std::cout << "Gray = " << SoundexHash("Gray") << std::endl;
std::cout << "Grey = " << SoundexHash("Grey") << std::endl;
return 0;
}

Related

I need to fix my code to count the syllables of multiple string inputs

As a new student to the C++ language I was originally given the assignment to write a code that would count the amount of syllables in a given string. Later it was changed on me to be able to count multiple strings.Now keep in mind I'm not to far along in the class and honestly I have my concerns about whether or not I'm actually learning what I need to pass this class. So I went back and started the frustrating process of changing my code when it already worked for a different function. I managed to produce the desired format of:
Word Syllable
Harry 2
Hairy 2
Hare 2
The 2
As you can tell it's not correct as it counts the syllables of only the first word and then applies it to the others. I tried changing it to a for loop but it didn't work so I went to a while loop and I got a somewhat better result:
Word Syllable
Harry 2
Word Syllable
Hare 1
So now it correctly counts the syllables but only of every other word instead of all and double prints the table header. Now even my cout command tells me it's ambiguous even though it still runs so I'm extra confused. I'm thinking I might have to change it into an array but at this point I'm completely stumped.
Here is my code so far:
#include <iostream>
#include <string>
#include <iomanip>
using namespace std;
int main()
{
cout << "Please enter four words: ";
string word;
while (cin >> word);
{
cin >> word;
bool last_vowel = false;
bool last_cons = false;
bool curr_vowel = false;
bool curr_cons = false;
int syllable_count = 0;
for (int i = 0; i < word.length(); i++)
{
string letter = word.substr(i, 1);
if (letter == "a" || letter == "e" ||
letter == "i" || letter == "o" ||
letter == "u" || letter == "y" ||
letter == "A" || letter == "E" ||
letter == "I" || letter == "O" ||
letter == "U" || letter == "Y")
{
curr_vowel = true;
curr_cons = false;
}
else
{
curr_vowel = false;
curr_cons = true;
}
// Increment the syllable count any time we
// transition from a vowel to a consonant
if (curr_cons && last_vowel)
{
syllable_count++;
}
last_vowel = curr_vowel;
last_cons = curr_cons;
}
// Check the last letter in word.
string last = word.substr(word.length() - 1, 1);
// Add one for an ending vowel that is not an "e"
if (last == "a" || last == "i" || last == "o" ||
last == "u" || last == "y" || last == "A" ||
last == "I" || last == "O" || last == "U" ||
last == "Y")
{
syllable_count++;
}
// There has to be at least one syllable
if (syllable_count == 0)
{
syllable_count = 1;
}
cout << left;
cout << setw(10) << "Word" << setw(20) << "Syllables" << endl;
cout << "__________________________" << endl;
cout << left;
cout << setw(19) << word << syllable_count << endl;
}
return 0;
}
Tony answer to your problem is very complicated. I saw someone writing this looks like an unbelievably difficult problem in linguistics while researching the conditions of syllables.
I was not able to find any hard-set rule that can tell you how many syllables are there in the word.
I found some conditions on the program and checked the result to the code on this online tool https://syllablecounter.net/count my guess is they have some list of words to exclude from counting or to be included even if they do not fall under basic conditions.
Wrote the below code to count the syllables in each word of the given phrase.
The program will continue to ask for phrase, Till you enter something other then y when asked to continue.
#include <iostream>
#include <string>
#include <sstream>
using namespace std;
int main() {
while (true) {
cout << "Enter the string you want to count syllable for: ";
string phrase, word;
string vowels = "aeiou";
getline(cin, phrase);
//cout << phrase << endl;
stringstream X(phrase); // Object of stringstream that references the phrase string
// iterate on the words of phrase
while (getline(X, word, ' ')) {
// logic to find syllable in the word.
// Traverse the string
int syllable_count = 0;
// if word[0] in vowels count as syllable
if(vowels.find(tolower(word[0])) != std::string::npos)
syllable_count +=1;
// if word[index] in vowels and word[index - 1] not in vowels count as syllable
for (int i = 1; i < word.size(); i++) {
if(vowels.find(tolower(word[i])) != std::string::npos && vowels.find(tolower(word[i-1])) == std::string::npos)
syllable_count +=1;
}
// if word ends with 'e' it does not cout as syllable
if(tolower(word[word.size()-1]) == 'e')
syllable_count -=1;
// if word ends with 'le' and not of 2 char only and last 3rd word not in vowels it count as syllable
if((word.size() > 2) && (tolower(word[word.size()-1]) == 'e' && tolower(word[word.size()-2]) == 'l' ) && vowels.find(tolower(word[word.size()-3])) == std::string::npos)
syllable_count +=1;
// if word end with 'y' treet it as syllable
if(tolower(word[word.size()-1]) == 'y'){
syllable_count +=1;
// if word end with 'y' and have vowel in last second position count both char as one syllable.
if((word.size() > 2) && vowels.find(tolower(word[word.size()-2])) != std::string::npos)
syllable_count -=1;
}
if(syllable_count == 0)
syllable_count += 1;
cout << word << " : " << syllable_count << endl; // print word and syllable cout
}
cout << "Press y if you want to continue: ";
string stop;
getline(cin, stop);
if( stop != "y")
return 0;
}
return 0;
}
Sample result
Enter the string you want to count syllable for: troy
troy : 1
Press y if you want to continue: y
Enter the string you want to count syllable for: test
test : 1
Press y if you want to continue: y
Enter the string you want to count syllable for: hi how are you
hi : 1
how : 1
are : 1
you : 1
Press y if you want to continue: y
Enter the string you want to count syllable for: Harry hairy hare the
Harry : 2
hairy : 2
hare : 1
the : 1
Press y if you want to continue: n
You might need to add some conditions as you find certain cases. This problem is not something that can be worked with if else only.
Hope this helps.

Do while loop won't break

I am making a Goldilocks game. If the user chooses the wrong answer it would loop back to the beginning of the program. When I try to choose any option it always loops back to the beginning including the correct answer which is 2. I am still new to c++. I do not understand why it is looping to the beginning if the condition is true when 2 is chosen.
#include "stdafx.h"
#include <iostream>
#include <string>
using namespace std;
void FirstSet()
{
bool win = false;
string PName;
int choice;
int num1, num2, result;
do
{
system("CLS");
cout << " Please Enter Name \n";
cin >> PName;
cout << PName << " and the Three Bears\n\n ";
cout << PName << " Walks up and sees 3 Doors, 1 Large Door, 1 Medium
Door and 1 Small Door. \n\n\n " << "Which Door do you want to Open?\n "
<< " 1 for The Large Door\n " << " 2 for the Medium Door\n " << " 3
for the small door\n ";
cin >> choice;
if (choice == '1')
{
cout << " The large door is too heavy it will not budge.\n "
<< " Please Try Again\n\n ";
system("pause");
}
else if (choice == '2')
{
win = true;
}
else if (choice == '3') {
cout << " The Door is too small you would get stuck.\n "
<< "Please Try Again\n\n";
}
} while (!win);
}
int main()
{
FirstSet();
system("pause");
return 0;`
The reason none of your comparisons are turning true is because you are reading the input into an int variable. Then you are comparing to ascii character values of 1,2 and 3 which happen to be 49, 50 and 51 respectively. If you modify your if lines to compare directly with integers, it should work:
if (choice == 1)
{
...
}
else if (choice == 2)
{
...
}
else if (choice == 3)
{
...
}
Although, for readability purposes and also to avoid such cases, I recommend using switch case statements in this case.

Specific Characters in Palindrome Program Do not work

Here is my code. The list of characters that do not "work" and continue to say that they are palindromes if wrapped around the cin still say they are correct. The list of characters that don't work are:
single quotes, double quotes, commas, periods, forward slashes, back slashes, dashes, exclamation points, # symbols, # symbols, $ symbols, % symbols, ^ symbols, & symbols, * symbols (asterisk), equals symbols, + symbol
int main()
{
int k = 1;
int i;
int length, halflength;
int yesno = 1;
char string [81];
char end[81] = "END";
while (k = 1)
{
cout << "Please enter a string of characters. " << endl;
cout << "Enter \"END\" in all caps to exit the program." << endl;
cin.getline(string, 81);
if (strcmp(string, "END") == 0)
{
return 0;
}
length = strlen(string);
halflength = length / 2;
for (i = 0; i < halflength; i++)
{
if (string[i] != string[length - i - 1]) // comparing
yesno = 0;
break;
}
if (yesno) {
cout << "You have successfully entered a palindrome." << endl;
}
else
{
cout << "You have not entered a palindrome." << endl;
return main();
}
}
}
I am unsure how to fix this, as a palindrome can not only be a sequence of letters, but a sequence of characters. If there is an easier way to compare the lines, then I would appreciate the help, as I have spent some time being frustrated at this.
your program says,"You have successfully entered a palindrome." for mlaylam!
the problem is not having the break statement in the right place.
the block should be enclosed within braces, otherwise(as you've done), after checking the first character and last, the for loop will break thereby giving wrong result.
if (string[i] != string[length - i - 1]){ // comparing
yesno = 0;
break;
}

string definition problem

int main()
{
int score;
string scoreLetter;
char A, B, C, D, F;
cout<<"Enter the grade: ";
cin>> score;
void letterGrade (int score);
void gradeFinal (int score);
cout<<"The letter grade is "<< scoreLetter<<".";
}
void letterGrade (int score)
{
if (score >= 90) {
string scoreLetter = 'A'
} else if (score >= 80) {
scoreLetter = 'B'
} else if (score >= 70)
-When compiling, I am getting an error for the line scoreLetter = 'A', etc. What is wrong with this? Error states 'scoreLetter' undefined. Do I need to define scoreLetter in the function not the main?
Single quotes in C/C++ are used to define a character, not a String. Use double quotes instead. string scoreLetter = "A". Also you need to include the header if std::string is what you are trying to use: #include <string> and using std::string.
You'll want to return whatever value your function determines scoreLetter should be, and in main have a line like scoreLetter = letterGrade(score);. You can't set local variables from another function, unless the caller passed them to you by reference, which isn't happening here (and, in most cases, shouldn't be abused like that).
Aside from that, it looks like you're mixing up declarations and invocations. void letterGrade (int score); doesn't actually call letterGrade; it just says that there is a letterGrade function that takes an int, which we'll call "score". (The score there is just a name that's part of the prototype; it has no connection to your score variable.) So you'll likely find that if you managed to get your code to compile, it'd do something quite different than you're expecting it to do.
To call a function, you'd do something like letterGrade(score);, or if you followed my first suggestion, scoreLetter = letterGrade(score);.
string letterGrade (int score);
// void gradeFinal (int score); // not used in this snippet
int main()
{
int score;
string scoreLetter;
char A, B, C, D, F;
cout<<"Enter the grade: ";
cin>> score;
scoreLetter = letterGrade(score);
cout<<"The letter grade is "<< scoreLetter<<".";
}
string letterGrade (int score)
{
string grade;
if (score >= 90) {
grade = "A";
} else if (score >= 80) {
grade = "B";
} else if (score >= 70)
...
return grade;
}
Your code is cut off there but I think from what is available you are not calling the functions properly.
You should also try to be more specific about what language you are talking about, i.e. C in this case I believe.
A correct call to the functions would be:
letterGrade(score);
And yes, you need to define the string variable at global scope, so just move it out of main and above it.
string scoreLetter;
int main()
{
int score;
char A, B, C, D, F;
cout<<"Enter the grade: ";
cin>> score;
letterGrade (score);
gradeFinal (score);
cout<<"The letter grade is "<< scoreLetter<<".";
}
void letterGrade (int score)
{
if (score >= 90) {
scoreLetter = "A"
} else if (score >= 80) {
scoreLetter = "B"
} else if (score >= 70)

Format string to title case

How do I format a string to title case?
Here is a simple static method to do this in C#:
public static string ToTitleCaseInvariant(string targetString)
{
return System.Threading.Thread.CurrentThread.CurrentCulture.TextInfo.ToTitleCase(targetString);
}
I would be wary of automatically upcasing all whitespace-preceded-words in scenarios where I would run the risk of attracting the fury of nitpickers.
I would at least consider implementing a dictionary for exception cases like articles and conjunctions. Behold:
"Beauty and the Beast"
And when it comes to proper nouns, the thing gets much uglier.
Here's a Perl solution http://daringfireball.net/2008/05/title_case
Here's a Ruby solution http://frankschmitt.org/projects/title-case
Here's a Ruby one-liner solution: http://snippets.dzone.com/posts/show/4702
'some string here'.gsub(/\b\w/){$&.upcase}
What the one-liner is doing is using a regular expression substitution of the first character of each word with the uppercase version of it.
To capatilise it in, say, C - use the ascii codes (http://www.asciitable.com/) to find the integer value of the char and subtract 32 from it.
This is a poor solution if you ever plan to accept characters beyond a-z and A-Z.
For instance: ASCII 134: å, ASCII 143: Å.
Using arithmetic gets you: ASCII 102: f
Use library calls, don't assume you can use integer arithmetic on your characters to get back something useful. Unicode is tricky.
In Silverlight there is no ToTitleCase in the TextInfo class.
Here's a simple regex based way.
Note: Silverlight doesn't have precompiled regexes, but for me this performance loss is not an issue.
public string TitleCase(string str)
{
return Regex.Replace(str, #"\w+", (m) =>
{
string tmp = m.Value;
return char.ToUpper(tmp[0]) + tmp.Substring(1, tmp.Length - 1).ToLower();
});
}
In Perl:
$string =~ s/(\w+)/\u\L$1/g;
That's even in the FAQ.
If the language you are using has a supported method/function then just use that (as in the C# ToTitleCase method)
If it does not, then you will want to do something like the following:
Read in the string
Take the first word
Capitalize the first letter of that word 1
Go forward and find the next word
Go to 3 if not at the end of the string, otherwise exit
1 To capitalize it in, say, C - use the ascii codes to find the integer value of the char and subtract 32 from it.
There would need to be much more error checking in the code (ensuring valid letters etc.), and the "Capitalize" function will need to impose some sort of "title-case scheme" on the letters to check for words that do not need to be capatilised ('and', 'but' etc. Here is a good scheme)
In what language?
In PHP it is:
ucwords()
example:
$HelloWorld = ucwords('hello world');
In Java, you can use the following code.
public String titleCase(String str) {
char[] chars = str.toCharArray();
for (int i = 0; i < chars.length; i++) {
if (i == 0) {
chars[i] = Character.toUpperCase(chars[i]);
} else if ((i + 1) < chars.length && chars[i] == ' ') {
chars[i + 1] = Character.toUpperCase(chars[i + 1]);
}
}
return new String(chars);
}
Excel-like PROPER:
public static string ExcelProper(string s) {
bool upper_needed = true;
string result = "";
foreach (char c in s) {
bool is_letter = Char.IsLetter(c);
if (is_letter)
if (upper_needed)
result += Char.ToUpper(c);
else
result += Char.ToLower(c);
else
result += c;
upper_needed = !is_letter;
}
return result;
}
http://titlecase.com/ has an API
There is a built-in formula PROPER(n) in Excel.
Was quite pleased to see I didn't have to write it myself!
Here's an implementation in Python: https://launchpad.net/titlecase.py
And a port of this implementation that I've just done in C++: http://codepad.org/RrfcsZzO
Here is a simple example of how to do it :
public static string ToTitleCaseInvariant(string str)
{
return System.Threading.Thread.CurrentThread.CurrentCulture.TextInfo.ToTitleCase(str);
}
I think using the CultureInfo is not always reliable, this the simple and handy way to manipulate string manually:
string sourceName = txtTextBox.Text.ToLower();
string destinationName = sourceName[0].ToUpper();
for (int i = 0; i < (sourceName.Length - 1); i++) {
if (sourceName[i + 1] == "") {
destinationName += sourceName[i + 1];
}
else {
destinationName += sourceName[i + 1];
}
}
txtTextBox.Text = desinationName;
In C#
using System.Globalization;
using System.Threading;
protected void Page_Load(object sender, EventArgs e)
{
CultureInfo cultureInfo = Thread.CurrentThread.CurrentCulture;
TextInfo textInfo = cultureInfo.TextInfo;
Response.Write(textInfo.ToTitleCase("WelcometoHome<br />"));
Response.Write(textInfo.ToTitleCase("Welcome to Home"));
Response.Write(textInfo.ToTitleCase("Welcome#to$home<br/>").Replace("#","").Replace("$", ""));
}
In C# you can simply use
CultureInfo.InvariantCulture.TextInfo.ToTitleCase(str.ToLowerInvariant())
Invariant
Works with uppercase strings
Without using a ready-made function, a super-simple low-level algorithm to convert a string to title case:
convert first character to uppercase.
for each character in string,
if the previous character is whitespace,
convert character to uppercase.
This asssumes the "convert character to uppercase" will do that correctly regardless of whether or not the character is case-sensitive (e.g., '+').
Here you have a C++ version. It's got a set of non uppercaseable words like prononuns and prepositions. However, I would not recommend automating this process if you are to deal with important texts.
#include <iostream>
#include <string>
#include <vector>
#include <cctype>
#include <set>
using namespace std;
typedef vector<pair<string, int> > subDivision;
set<string> nonUpperCaseAble;
subDivision split(string & cadena, string delim = " "){
subDivision retorno;
int pos, inic = 0;
while((pos = cadena.find_first_of(delim, inic)) != cadena.npos){
if(pos-inic > 0){
retorno.push_back(make_pair(cadena.substr(inic, pos-inic), inic));
}
inic = pos+1;
}
if(inic != cadena.length()){
retorno.push_back(make_pair(cadena.substr(inic, cadena.length() - inic), inic));
}
return retorno;
}
string firstUpper (string & pal){
pal[0] = toupper(pal[0]);
return pal;
}
int main()
{
nonUpperCaseAble.insert("the");
nonUpperCaseAble.insert("of");
nonUpperCaseAble.insert("in");
// ...
string linea, resultado;
cout << "Type the line you want to convert: " << endl;
getline(cin, linea);
subDivision trozos = split(linea);
for(int i = 0; i < trozos.size(); i++){
if(trozos[i].second == 0)
{
resultado += firstUpper(trozos[i].first);
}
else if (linea[trozos[i].second-1] == ' ')
{
if(nonUpperCaseAble.find(trozos[i].first) == nonUpperCaseAble.end())
{
resultado += " " + firstUpper(trozos[i].first);
}else{
resultado += " " + trozos[i].first;
}
}
else
{
resultado += trozos[i].first;
}
}
cout << resultado << endl;
getchar();
return 0;
}
With perl you could do this:
my $tc_string = join ' ', map { ucfirst($\_) } split /\s+/, $string;

Resources