String compares can be costly. There's some statistic floating around that says a very high percent of string compares can be eliminated by first comparing string sizes. So I'm curious to know whether the NSString compare: method takes this into consideration. Anyone know?
According to the sources here (which is just one implementation, others may act differently), compare doesn't check the length first, which actually makes sense since it's not an equality check. As it returns a less-than/equal-to/greater-than return code, it has to check the characters, even if the lengths are the same.
A pure isEqual-type method may be able to shortcut character checks if the lengths are different, but compare does not have that luxury.
It does do certain checks of the length against zero, but not comparisons of the two lengths against each other.
Yes it does. It also checks for pointer equality before that (which covers the constant string case and some others due to string uniquing and the string ROM).
(edit) This answer applies to -isEqualToString:, not -compare:. I misread
Related
There is a coding puzzle I have encountered on one of those sites (I don't recall if it was leetcode or something else) which goes as follows: Given a list of strings, return the lexicographically smallest concatenation that uses each of the strings once. The solution is, on a technical level, fairly simple. You compare 2 strings a and b by checking whether ab<ba (lexicographically), sort the list, concatenate everything.
Now for the actual question: Does this ordering have a name? I tried googling around but never found anything.
There is also a secondary aspect to this, which is: Is it somehow immediately obvious that this is even a strict weak ordering? I certainly didn't think it was. Here is the proof that I came up with to convince myself that it is one:
For any given string s let |s| be its length and let s^n be s repeated n times.
If ab<ba then a^|b|b^|a|<b^|a|a^|b| (to see this just successively swap neighboring ab pairs to get a lexicographically increasing sequence that ends in b^|a|a^|b|). It follows that a^|b|<b^|a| because they have the same length. The same argument works for > and = so we have proven that ab<ba is actually equivalent to a^|b|<b^|a|, with the latter clearly defining a strict weak ordering.
Leetcode Description
Given a string of length n, you have to decide whether the string is a palindrome or not, but you may delete at most one character.
For example: "aba" is a palindrome.
"abca" is also valid, since we may remove either "b" or "c" to get a palindrome.
I have seen many solutions that take the following approach.
With two pointers left and right initialized to the start and the end characters of the string, respectively, keep incrementing left and decrementing right synchronously as long as the two characters pointed by left and right are equal.
The first time we run into a mismatch between the characters pointed by left and right, and say these are specifically indices i and j, we simply check whether string[i..j-1] or string[i+1..j] is a palindrome.
I clearly see why this works, but one thing that's bothering me is the direction of the approach that we take when we first see the mismatch.
Assuming we do not care about time efficiency and only focus on correctness, I cannot see what prevents us from trying to delete a character in string[0..i-1] or string[j+1..n-1] and try to look whether the entire resulting string can become a palindrome or not?
More specifically, if we take the first approach and see that both string[i..j-1] and string[i+1..j] are not palindromes, what prevents us from backtracking to the second approach I described and see if deleting a character from string[0..i-1] or string[j+1..n-1] will yield a palindrome instead?
Can we mathematically prove why this approach is useless or simply incorrect?
Trying to improve the performance of a function that compares strings I decided to compare them by comparing their hashes.
So is there a guarantee if the hash of 2 very long strings are equal to each other then the strings are also equal to each other?
While it's guaranteed that 2 identical strings will give you equal hashes, the other way round is not true : for a given hash, there are always several possible strings which produce the same hash.
This is true due to the PigeonHole principle.
That being said, the chances of 2 different strings producing the same hash can be made infinitesimal, to the point of being considered equivalent to null.
A fairly classical example of such hash is MD5, which has a near perfect 128 bits distribution. Which means that you have one chance in 2^128 that 2 different strings produce the same hash. Well, basically, almost the same as impossible.
In the simple common case where two long strings are to be compared to determine if they are identical or not, a simple compare would be much preferred over a hash, for two reasons. First, as pointed out by #wildplasser, the hash requires that all bytes of both strings must be traversed in order to calculate the two hash values, whereas the simple compare is fast, and only needs to traverse bytes until the first difference is found, which may be much less than the full string length. And second, a simple compare is guaranteed to detect any difference, whereas the hash gives only a high probability that they are identical, as pointed out by #AdamLiss and #Cyan.
There are, however, several interesting cases where the hash comparison can be employed to great advantage. As mentioned by #Cyan if the compare is to be done more than once, or must be stored for later use, then hash may be faster. A case not mentioned by others is if the strings are on different machines connected via a local network or the Internet. Passing a small amount of data between the two machines will generally be much faster. The simplest first check is compare the size of the two, if different, you're done. Else, compute the hash, each on its own machine (assuming you are able to create the process on the remote machine) and again, if different you are done. If the hash values are the same, and if you must have absolute certainty, there is no easy shortcut to that certainty. Using lossless compression on both ends will allow less data to be transferred for comparison. And finally, if the two strings are separated by time, as alluded to by #Cyan, if you want to know if a file has changed since yesterday, and you have stored the hash from yesterday's version, then you can compare today's hash to it.
I hope this will help stimulate some "out of the box" ideas for someone.
I am not sure, if your performance will be improved. Both: building hash + comparing integers and simply comparing strings using equals have same complexity, that lays in O(n), where n is the number of characters.
I've used a profiler on my C# application and realised that String.Compare() is taking a lot of time overall: 43% of overall time with 124M hits
I'm comparing relatively small string: from 4 to 50 chars.
What would you recommend to replace it with in terms of performance??
UPD: I only need to decide if 2 strings are the same or not. Strings can be zero or "". No cultural aspect or any other aspect to it. Most of the time it'll be "4578D" compared to "1235E" or similar.
Many thanks in advance!
It depends on what sort of comparison you want to make. If you only care about equality, then use one of the Equals overloads - for example, it's quicker to find that two strings have different lengths than to compare their contents.
If you're happy with an ordinal comparison, explicitly specify that:
int result = string.CompareOrdinal(x, y);
An ordinal comparison can be much faster than a culture-sensitive one.
Of course, that assumes that an ordinal comparison gives you the result you want - correctness is usually more important than performance (although not always).
EDIT: Okay, so you only want to test for equality. I'd just use the == operator, which uses an ordinal equality comparison.
You can use different ways of comparing strings.
String.Compare(str1, str2, StringComparison.CurrentCulture) // default
String,Compare(str1, str2, StringComparison.Ordinal) // fastest
Making an ordinal comparison can be something like twice as fast as a culture dependant comparison.
If you make a comparison for equality, and the strings doesn't contain any culture depenant characters, you can very well use ordinal comparison.
I want to map some strings(word) with number. the similar the string, the nearer their value(mapped number) . also, while checking the positional combination of the letters should impact the mapping.the mapping function should be function of letters, positions (combination given position of letter thepriority such as pit and tip should be different), number of letters.
Well, I would give some examples : starter, stater , stapler, startler, tstarter are some words. These words are of format "(*optinal)sta(*opt)*er" where * denotes some sort of variable in our case it is either 't' or 'l' (i.e. in case of starter and staler). these all should be mapped INDIVIDUALLY, without context to other such that their value are not of much difference. and later on which creating groups I can put appropriate range of numbers for differentiating groups.
So while mapping the string their values should be similar. there are many words, so comparing each other would be complex. so mapping with some numeric value for each word independently and putting the similar string (as they have similar value) in a group and then later find these pattern by other means.
So, for now I need to look up for some existing methods of mapping such that similar strings (I guess I have clarify the term 'similar' for my context) have similar value and these value should be different to the dissimilar ones. please, again I emphasize that the number of string would be huge and comparing each with other is practically impossible(or computationally expensive and much slow).SO WHAT I THINK IS TO DEVISE AN ALGORITHM(taking help from existing ones) FOR MAPPING WORD(STRING) ON ITS OWN
Have I made you clear? Please give me some idea to start with. some terms to search and research.
I think I need some type of "bad" hash function to hash strings and then put them in bucket according to that hash value. at least some idea or algorithm names.
Seems like it would best to use a known algorithm like Levenshtein Distance
This search on StackOverflow
reveals this question about finding-groups-of-similar-strings-in-a-large-set-of-strings, which links to this article describing a SimHash which sounds exactly like what you want.