Alternatives to String.Compare for performance

Alternatives to String.Compare for performance - string

I've used a profiler on my C# application and realised that String.Compare() is taking a lot of time overall: 43% of overall time with 124M hits
I'm comparing relatively small string: from 4 to 50 chars.
What would you recommend to replace it with in terms of performance??
UPD: I only need to decide if 2 strings are the same or not. Strings can be zero or "". No cultural aspect or any other aspect to it. Most of the time it'll be "4578D" compared to "1235E" or similar.
Many thanks in advance!

It depends on what sort of comparison you want to make. If you only care about equality, then use one of the Equals overloads - for example, it's quicker to find that two strings have different lengths than to compare their contents.
If you're happy with an ordinal comparison, explicitly specify that:
int result = string.CompareOrdinal(x, y);
An ordinal comparison can be much faster than a culture-sensitive one.
Of course, that assumes that an ordinal comparison gives you the result you want - correctness is usually more important than performance (although not always).
EDIT: Okay, so you only want to test for equality. I'd just use the == operator, which uses an ordinal equality comparison.

You can use different ways of comparing strings.
String.Compare(str1, str2, StringComparison.CurrentCulture) // default
String,Compare(str1, str2, StringComparison.Ordinal) // fastest
Making an ordinal comparison can be something like twice as fast as a culture dependant comparison.
If you make a comparison for equality, and the strings doesn't contain any culture depenant characters, you can very well use ordinal comparison.

Related

Are there example Strings for unittests? (e.g. with control characters)

I have to write several unittests in a software of ours. In many ocasions I came across Strings which are Internally set/User set or even persisted (e.g. Database/XML etc.)
Does anyone know of a character-set or string(s) to do simple tests, whether input is equal to output?
I know how to test, my question regards what to test.
For numbers you e.g test values around 0 or - + numbers or if they're retrieved from a string, floating point numbers and integer numbers.
But for Text/String I'm not aware of standard sequences to test.

Here is the Big List of Naughty Strings - sounds like it might be what you need!
The Big List of Naughty Strings is an evolving list of strings which have a high probability of causing issues when used as user-input data.

How could you sort string words in low level?

Of course there are handy library functions for all kind of languages to sort strings out. However I am interested in to know the low level details of string sorting. Mt naive belief is to use ASCII values of strings to convert the problem into numerical sorting. However, if the string word are larger than a single character then the thing is little complicated for me. What is the state of art sorting approach for multi-character sorting ?

Strings are typically just sorted with a comparison-based sorting algorithm, such as quick-sort or merge-sort (I know of a few libraries that does this, and I'd assume most would, although there can certainly be exceptions).
But you could indeed convert your string to a numeric value and use a distribution sort, such as counting-, bucket- or radix-sort, instead.
But there's no silver bullet here - the best solution will largely depend on the scenario - it's really just something you have to benchmark with the sorting implementations you're using, on the system you're working, with your typical data.

Naive sorting of ASCII strings is naive because it basically treats the strings as numbers written in base-128 or base-256. Dictionaries are meant for human usage and sort strings according to more complex criteria.
A pretty elaborate example is the 'Unicode Technical Standard #10' - Unicode Collation Algorithm.

Fastest way to determine if a string contains a character

I have a string which consists of unicode characters. The same character can occur only once.
The length of the string is between 1 and ~50.
What is the fastest way to check if a particular character is in the string or not?
Iterating the string is not a good choice, isn't it? Is there any efficient algorithm for this purpose?
My first idea was to keep the characters in the string alphabetically sorted. It could be searched quickly, but the sorting and the comparison of unicode characters are not so trivial (using the right collation) and it has a big cost, probably bigger then iterating the whole string.
Maybe some hashing? Maybe the iteration is the fastest way?
Any idea?

If there's no preprocessing, the simplest and fastest way is to iterate through the characters.
If there's preprocessing, the previous approach might still the best, or you could try a small hashtable which stores whether a string contains that character. Storing the hash will take extra space, but could be better for memory cache (with low hash collision & assuming you don't have to access the actual string). Make sure you measure the peformance.
I have a feeling you're trying to over-engineer a really simple task. Have you verified that this is a bottleneck in your application?

A linear search through the string is O(n) with each operation being very simple. Sorting the string is O(n log n) with more complicated operations. It's pretty clear that the linear search will be faster in all cases.
If the characters are stored in UTF-8 or UTF-16 encoding then there's a possibility that you'll need to search for more than one contiguous element. There are ways to speed that up, such as Boyer-Moore or Knuth-Morris-Pratt. It's unclear whether there would be an actual speedup with such short search strings.

Is it a repeated operation on the same string or 1 time task ? If it is a 1 time task, then you can't do better than going through the string after all you have to look at all characters. O(n)
If it is repeated operation then you can do some preprocessing of the strings to make the subsequent operations faster. The most space efficient and fastest would be to build bloom filters for the characters in each string. Once built which is is fast too, you can say if a character is not present in 0(1) and only do a binary search of the sorted string only if bloom filter says yes.

NSString compare efficiency

String compares can be costly. There's some statistic floating around that says a very high percent of string compares can be eliminated by first comparing string sizes. So I'm curious to know whether the NSString compare: method takes this into consideration. Anyone know?

According to the sources here (which is just one implementation, others may act differently), compare doesn't check the length first, which actually makes sense since it's not an equality check. As it returns a less-than/equal-to/greater-than return code, it has to check the characters, even if the lengths are the same.
A pure isEqual-type method may be able to shortcut character checks if the lengths are different, but compare does not have that luxury.
It does do certain checks of the length against zero, but not comparisons of the two lengths against each other.

Yes it does. It also checks for pointer equality before that (which covers the constant string case and some others due to string uniquing and the string ROM).
(edit) This answer applies to -isEqualToString:, not -compare:. I misread

Comparing long strings by their hashes

Trying to improve the performance of a function that compares strings I decided to compare them by comparing their hashes.
So is there a guarantee if the hash of 2 very long strings are equal to each other then the strings are also equal to each other?

While it's guaranteed that 2 identical strings will give you equal hashes, the other way round is not true : for a given hash, there are always several possible strings which produce the same hash.
This is true due to the PigeonHole principle.
That being said, the chances of 2 different strings producing the same hash can be made infinitesimal, to the point of being considered equivalent to null.
A fairly classical example of such hash is MD5, which has a near perfect 128 bits distribution. Which means that you have one chance in 2^128 that 2 different strings produce the same hash. Well, basically, almost the same as impossible.

In the simple common case where two long strings are to be compared to determine if they are identical or not, a simple compare would be much preferred over a hash, for two reasons. First, as pointed out by #wildplasser, the hash requires that all bytes of both strings must be traversed in order to calculate the two hash values, whereas the simple compare is fast, and only needs to traverse bytes until the first difference is found, which may be much less than the full string length. And second, a simple compare is guaranteed to detect any difference, whereas the hash gives only a high probability that they are identical, as pointed out by #AdamLiss and #Cyan.
There are, however, several interesting cases where the hash comparison can be employed to great advantage. As mentioned by #Cyan if the compare is to be done more than once, or must be stored for later use, then hash may be faster. A case not mentioned by others is if the strings are on different machines connected via a local network or the Internet. Passing a small amount of data between the two machines will generally be much faster. The simplest first check is compare the size of the two, if different, you're done. Else, compute the hash, each on its own machine (assuming you are able to create the process on the remote machine) and again, if different you are done. If the hash values are the same, and if you must have absolute certainty, there is no easy shortcut to that certainty. Using lossless compression on both ends will allow less data to be transferred for comparison. And finally, if the two strings are separated by time, as alluded to by #Cyan, if you want to know if a file has changed since yesterday, and you have stored the hash from yesterday's version, then you can compare today's hash to it.
I hope this will help stimulate some "out of the box" ideas for someone.

I am not sure, if your performance will be improved. Both: building hash + comparing integers and simply comparing strings using equals have same complexity, that lays in O(n), where n is the number of characters.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string