determining soundex conversion - soundex

when converting the name 'Lukasieicz' to soundex (LETTER,DIGIT,DIGIT,DIGIT,DIGIT), I come up with L2222.
However, I am being told by my lecture slides that the actual answer is supposed to be L2220.
Please explain why my answer is incorrect, or if the lecture answer was just a typo or something.
my steps:
Lukasieicz
remove and keep L
ukasieicz
Remove contiguous duplicate characters
ukasieicz
remove A,E,H,I,O,U,W,Y
KSCZ
convert up to first four remaining letters to soundex (as described in lecture directions)
2222
append beginning letter
L2222

If this is American Soundex as defined by the National Archives you're both wrong. American Soundex contains one letter and three numbers, you can't have L2222 nor L2220. It's L222.
But let's say they added another number for some reason.
The basic substitution gives L2222. But you're supposed to collapse adjacent letters with the same numbers (step 3 below) and then pad with zeros if necessary (step 4).
If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
If you have too few letters in your word that you can't assign [four] numbers, append with zeros until there are [four] numbers. If you have more than [4] letters, just retain the first [4] numbers.
Lukasieicz # the original word
L_2_2___22 # replace with numbers, leave the gaps in
L_2_2___2 # apply step 3 and squeeze adjacent numbers
L2220 # apply step 4 and pad to four numbers
We can check how conventional (ie. three number) soundex implementations behave with the shorter Lukacz which becomes L_2_22. Following rules 3 and 4, it should be L220.
The National Archives recommends an online Soundex calculator which produces L220. So does PostgreSQL and Text::Soundex in both its original flavor and NARA implementations.
$ perl -wle 'use Text::Soundex; print soundex("Lukacz"); print soundex_nara("Lukacz")'
L220
L220
MySQL, predictably, is doing its own thing and returns L200.
This function implements the original Soundex algorithm, not the more popular enhanced version (also described by D. Knuth). The difference is that original version discards vowels first and duplicates second, whereas the enhanced version discards duplicates first and vowels second.
In conclusion, you forgot the squeeze step.

Related

Check if string contains consecutive repeated substring

I got an interview problem which asks to determine whether or not a given string contains substring repeated right after it. For example:
ATAYTAYUV contains TAY after TAY
AABCD contains A after A
ABCAB contains two AB, but they are not consecutive, so the answer is negative
My idea was to look at the first letter, find its second occurrence then check letter by letter if the letters after the first occurrence match the letters after the second occurrence. If they all do, the answer is positive. If not, once I get a mismatch, I can repeat the process but starting with the last letter I checked, since I would not be able to get a repeated sequence up to that point.
I am not sure if the approach is correct or if it is the mos efficient.
Assume that you are looking for a repeating pattern of length 3. If you write the string shifted right by three positions in front of itself (and trimmed), you can detect runs of 3 identical characters.
ATAYTAYUV
ATAYTA
Repeat this for all lengths up to N/2.

How to make an excel (365) function that recognizes different words in the same cell and changes them individually

What im working with
I have a list of product names, but unfortunately they are written in uppercase I now want to make only the first letter uppercase and the rest lowercase but I also want all words with 3 or less symbols to stay uppercase
im trying if functions but nothing is really working
i use the german excel version but i would be happy if someone has any idea on how to do it im trying different functions for hours but nothing is working
=IF(LENGTH(C6)<=3,UPPER(C6),UPPER(LEFT(C6,1))&LOWER(RIGHT(C6,LENGTH(C6)-1)))
but its a #NAME error excel does not recognize the first and the last bracket
This is hard! Let me explain:
I do believe there are German words in the mix that are below 4 characters in length that you should exclude. My German isn't great but there would probably be a huge deal of words below 4 characters;
There seems to be substrings that are 3+ characters in length but should probably stay uppercase, e.g. '550E/ER';
There seem to be quite a bunch of characters that could be used as delimiters to split the input into 'words'. It's hard to catch any of them without a full list;
Possible other reasons;
With the above in mind I think it's safe to say that we can try to accomplish something that you want as best as we can. Therefor I'd suggest
To split on multiple characters;
Exclude certain words from being uppercase when length < 3;
Include certain words to be uppercase when length > 3 and digits are present;
Assume 1st character could be made uppercase in any input;
For example:
Formula in B1:
=MAP(A1:A5,LAMBDA(v,LET(x,TEXTSPLIT(v,{"-","/"," ","."},,1),y,TEXTSPLIT(v,x,,1),z,TEXTJOIN(y,,MAP(x,LAMBDA(w,IF(SUM(--(w={"zu","ein","für","aus"})),LOWER(w),IF((LEN(w)<4)+SUM(IFERROR(FIND(SEQUENCE(10,,0),w),)),UPPER(w),LOWER(w)))))),UPPER(LEFT(z))&MID(z,2,LEN(v)))))
You can see how difficult it is to capture each and every possibility;
The minute you exclude a few words, another will pop-up (the 'x' between numbers for example. Which should stay upper/lower-case depending on the context it is found in);
The second you include words containing digits, you notice that some should be excluded ('00SICHERUNGS....');
If the 1st character would be a digit, the whole above solution would not change 1st alpha-char in upper;
Maybe some characters shouldn't be used as delimiters based on context? Think about hypenated words;
Possible other reasons.
Point is, this is not just hard, it's extremely hard if not impossible to do on the type of data you are currently working with! Even if one is proficient with writing a regular expression (chuck in all (non-available to Excel) tokens, quantifiers and methods if you like), I'd doubt all edge-case could be covered.
Because you are dealing with any number of words in a cell you'll need to get crafty with this one. Thankfully there is TEXTSPLIT() and TEXTJOIN() that can make short work of splitting the text into words, where we can then test the length, change the capitalization, and then join them back together all in one formula:
=TEXTJOIN(" ", TRUE, IF(LEN(TEXTSPLIT(C6," "))<=3,UPPER(TEXTSPLIT(C6," ")),PROPER(TEXTSPLIT(C6," "))))
Also used PROPER() formula as well, which only capitalizes the first character of a word.

How do I concatenate combinations of letters and numbers in APL?

I'm in Dyalog 17 and would like to generate unique names to be used with its graphics object library. So, for example, I have the letter 'l' and want to take the number 1, convert it to a character and then concatenate the two together to form 'l1'. This is such trivial stuff in other languages but I can't find the documentation explaining how to do this in APL. Thanks for your help!
To concatenate the letter 'l' to the number 1 to form the characters 'l1' you do this:
'l',⍕1
The system function ⎕FMT can be of use here. For example:
'P<I>ZI7' ⎕FMT ⍳10
I000000
I000001
I000002
I000003
I000004
I000005
I000006
I000007
I000008
I000009
The format string specifies to format the numbers as integers, in width of 7, zero filling, with a positive left decoration of the letter'I'.
I'm on APL2 in the Mainframe, so my answer might not be exactly what you're after, but here's how I would do it:
∊⍕¨'L',1
So first catanate the letter and the numeric digits. Then FORMAT EACH to produce a vector of character scalars. Finally, ENLIST to produce a simple vector.
This is a slight generalization of SteveH's reply. More general in the sense that it handles input strings (rather than scalars) and works equally well regardless if the digit or letter comes first.

Finding all words in a paragraph whose first three letters are the same?

How can we solve this problem in a best way? Is there any algorithm for solving this?
"In a paragraph we have to find and print all the words which have starting 3 letters same. Example: we input some paragraph and as a output we get letters like-
a) 1. you 2. your 3. yours 4. yourself
b) 1. early 2. earlier 3. earliest
Like this we get all the words of paragraph which have starting 3 letters common"
A reasonable solution that's not too hard to code up is to maintain a map of some sort where the keys are the first three letters of each word and the values are the sets of words that start with those three letters. You can scan across the words in the paragraph and, for each one you encounter, trim off the first three words, look up the map entry corresponding to those letters, and add in that word to the list. You can then iterate over the map at the end, find all sets containing at least two words, then print out each cluster you find.
Overall, the runtime of this approach is O(L), where L is the total length of all the words in the paragraph. To see this, notice that for each word, we do a map lookup on a constant-sized prefix of that word, then copy all the characters of the word into the map. Overall, this visits each character at most a constant number of times.
Trie with the first three characters and then the word index as the leaf should do the trick.

How will you sort strings in the following example?

so i have a list of string
{test,testertest,testing,tester,testingtest}
I want to sort it in descending order .. how do u sort strings in general ? Is it based on the length or is it character by character ??
how would it be in the example above ?? I want to sort them in a descending way.
No matter what language you’re in, there’s a built-in sort function that performs a lexicographical order, which returns
['test','tester','testertest','testing','testingtest']
for your example. If I wanted this reversed, I would just say reversed(sorted(myList)) in Python and be done with it. If you look to your right you can see plenty of related questions that require a more specialized ordering method (for numbers, dates, etc.), but lexicographic order works on strings containing any kind of data.
Here’s how it works:
compare(string A, string B):
if A and B are both non-empty:
if A[0] == B[0]:
// First letters are the same; compare by the rest
return compare(A[1:], B[1:])
else:
// Compare the first letters by Unicode code point
return compare(A[0], B[0])
else:
// They were equal till now; the shorter one shall be sorted first
return compare(length of A, length of B)
I would sort it like this:
testingtest
testing
testertest
tester
test
Assuming C#
string[] myStrings = {"test","testertest","testing","tester","testingtest"};
Array.Sort(myStrings);
Array.Reverse(myStrings);
foreach(string s in myStrings)
{
Console.WriteLine(s);
}
Not always an ideal way to do it - you could implement a custom comparer instead - but for the trivial example you asked about this is probably the most logical approach.
In computer science strings are usually sorted character by character, with the preferred sort order being (for a standard english character set):
Null characters first
Followed by whitepsace
Followed by symbols
Followed by numeric characters in obvious numerical order
Followed by alphabetic characters in obvious alphabetical order
When sorting characters generally lowercase characters come before uppercase characters.
So for example if we were to sort / compare:
test i ng
test e r
Then "tester" would come before "testing" - the first different character in the string is the 5th one, and "e" comes before "i".
Similarily if we were to compare:
test
testing
Then in this case "test" would come first - once again the strings are identical until the 5th character, where the string "test" ends (i.e. no character) which becomes before any alphanumerical character.
Note that this can produce some counter-intutive results when dealing with numbers - for example try sorting the strings "50" and "100" - you will find that "100" comes before "50". Why? because the strings differ at character 1 and "5" comes after "1".
In nearly all languages there is a function which will do all of the above for you!
You should use that function instead of trying to sort strings yourself! For example:
// C#
string[] myStrings = {"test","testertest","testing","tester","testingtest"};
Array.Sort(myStrings);
in Java you can use natural ordering with
java.util.Collections.sort(list);
the make it descending
java.util.Collections.reverse(list);
or create your own Comparator to do the reverse sorting.
When comparing two strings to see which sorts first, the comparison is typically done on a character by character basis. If the characters in the first position (e.g., t in your example) are identical, you move to the next character. When two characters differ, that "may" define which string is considered "greater".
However, depending on the locale used and a number of other factors, it is possible for later characters in the two strings being compared to override a difference in an earlier character. For example, in some collations, the diacritics on letters are considered to be of secondary weight. So a primary difference in a later character can override the secondary difference.
When two strings are otherwise identical but one is longer, the longer one is typically considered to be "greater". When sorting in descending order, the "greater" of two strings is sorted first.
Do you want to know if test should appear after tester in a descending order? Or are you particularly interested in sorting strings with similar prefixes?
If it's the later, I'd suggest a Trie if the input tends to grow big time.

Resources