Are there example Strings for unittests? (e.g. with control characters) - string

I have to write several unittests in a software of ours. In many ocasions I came across Strings which are Internally set/User set or even persisted (e.g. Database/XML etc.)
Does anyone know of a character-set or string(s) to do simple tests, whether input is equal to output?
I know how to test, my question regards what to test.
For numbers you e.g test values around 0 or - + numbers or if they're retrieved from a string, floating point numbers and integer numbers.
But for Text/String I'm not aware of standard sequences to test.

Here is the Big List of Naughty Strings - sounds like it might be what you need!
The Big List of Naughty Strings is an evolving list of strings which have a high probability of causing issues when used as user-input data.

Related

Collections: How will you find the top 10 longest strings in a list of a billion strings?

I was recently asked a question in an interview. How will you find the top 10 longest strings in a list of a billion strings?
My Answer was that we need to write a Comparator that compares the lengths of 2 strings and then Use the TreeSet(Comparator) constructor.
Once you start adding the strings in the Treeset it will sort as per the sorting order of the comparator defined.
Then just pop the top 10 elements of the Treeset.
The Interviewer wasn't happy with that. The argument was that, to hold billion strings I will have to use a super computer.
Is there any other data stucture than can deal with this kind of data?
Given what you stated about the interviewer saying you would need a super computer, I am going to assume that the strings would come in a stream one string at a time.
Given the immense size due to no knowledge of how large the individual strings are (they could be whole books), I would read them in one at a time from the stream. I would then compare the current string to an ordered list of the top ten longest strings found before it and place it accordingly in the ordered list. I would then remove the smallest length one from the list and proceed to read the next string. That would mean only 11 strings were being stored at one time, the current top 10 and the one currently being processed.
Most languages have a built in sort that is pretty speedy.
stringList.sort(key=len)
in python would work. Then just grab the first 10 elements.
Also your interviewer does sounds behind the times. One billion strings is pretty small now a days
I remember studying similar data structure for such scenarios called as Trie
The height of the tree will give the longest string always.
A special kind of trie, called a suffix tree, can be used to index all suffixes in a text in order to carry out fast full text searches.
The point is you do not need to STORE all strings.
Let's think a simplified version: Find the longest 2 string (assuming no tie case)
You can always do a online algorithm like using 2 variables s1 & s2, where s1 is longest string you encountered so far, s2 is the second longest
Then you use O(N) to read the strings one by one, replace s1 or s2 when it can. This use O(2N) = O(N)
For top 10 strings, it is as dumb as the top 2 case. You can still do it in O(10N) = O(N) and store only 10 strings.
There is a faster way describe as follow but for given constant like 2 or 10, you may not need it.
For top-K strings in general, you can use structure like set in C++ (with longer having higher priority) to store the top-K strings, when a new string comes, you simply insert it, and remove the last one, both use O(lg K). So total you can do it in O(N lg K) with O(K) space.

Important algorithm involving random access to a string?

I am implementing a different string representation where accessing a string in non-sequential manner is very costly. To avoid this I try to implement certain position caches or character blocks so one can jump to certain locations and scan from there.
In order to do so, I need a list of algorithms where scanning a string from right to left or random access of its characters is required, so I have a set of test cases to do some actual benchmarking and to create a model I can use to find a local/global optimum for my efforts.
Basically I know of:
String.charAt
String.lastIndexOf
String.endsWith
One scenario where one needs right to left access of strings is extracting the file extension and the file name (item) of paths.
For random access i find no algorithm at all unless one has prefix tables and access the string more randomly checking all those positions for longer than prefix strings.
Does anyone know other algorithms with either right to left or random access of string characters is required?
[Update]
The calculation of the hash-code of a String is calculated using every character and accessed from left to right along the value is stored in a local primary variable. So this is not something for random access.
Also the MD5 or CRC algorithm also all process the complete string. So I do not find any random access examples at all.
One interesting algorithm is Boyer-Moore searching, which involves both skipping forward by a variable number of characters and comparing backwards. If those two operations are not O(1), then KMP searching becomes more attractive, but BM searching is much faster for long search patterns (except in rare cases where the search pattern contains lots of repetitions of its own prefix). For example, BM shines for patterns which must be matched at word-boundaries.
BM can be implemented for certain variable-length encodings. In particular, it works fine with UTF-8 because misaligned false positives are impossible. With a larger class of variable-length encodings, you might still be able to implement a variant of BM which allows forward skips.
There are a number of algorithms which require the ability to reset the string pointer to a previously encountered point; one example is word-wrapping an input to a specific line length. Those won't be impeded by your encoding provided your API allows for saving a copy of an iterator.

Alternatives to String.Compare for performance

I've used a profiler on my C# application and realised that String.Compare() is taking a lot of time overall: 43% of overall time with 124M hits
I'm comparing relatively small string: from 4 to 50 chars.
What would you recommend to replace it with in terms of performance??
UPD: I only need to decide if 2 strings are the same or not. Strings can be zero or "". No cultural aspect or any other aspect to it. Most of the time it'll be "4578D" compared to "1235E" or similar.
Many thanks in advance!
It depends on what sort of comparison you want to make. If you only care about equality, then use one of the Equals overloads - for example, it's quicker to find that two strings have different lengths than to compare their contents.
If you're happy with an ordinal comparison, explicitly specify that:
int result = string.CompareOrdinal(x, y);
An ordinal comparison can be much faster than a culture-sensitive one.
Of course, that assumes that an ordinal comparison gives you the result you want - correctness is usually more important than performance (although not always).
EDIT: Okay, so you only want to test for equality. I'd just use the == operator, which uses an ordinal equality comparison.
You can use different ways of comparing strings.
String.Compare(str1, str2, StringComparison.CurrentCulture) // default
String,Compare(str1, str2, StringComparison.Ordinal) // fastest
Making an ordinal comparison can be something like twice as fast as a culture dependant comparison.
If you make a comparison for equality, and the strings doesn't contain any culture depenant characters, you can very well use ordinal comparison.

VB6 - Is there any performance benefit gained by using fixed-width strings in VB6?

In pre-.NET Visual Basic, a programmer could declare a string to be a certain width. For example, I know that a social-security number (in the US) is always eleven characters. So, I can declare a string that would store social-security numbers as an eleven-character string like this:
Dim SSN As String * 11
My question is: does this create any type of performance benefit that would either make the code run faster or perhaps use less memory? Also, would a fixed-length string be allocated in memory differently (i.e.: on the stack as opposed to in the heap)?
No, there is no performance benefit.
BUT even if there were, unless you were calling many (say millions) times in a loop, any performance benefit would be negligible.
Also, fixed-length strings occupy more memory than variable-length ones if you are not using the entire length (unless very short fixed length strings).
As always, you should carefully benchmark before making the code harder to maintain.
Fixed length strings were usually seen when interacting with some COM API's, or when modelling to domain constraints (such as the example you gave of a SSN)
The only time in VB6 or earlier that I had to use fixed length strings was with working with API calls. Not passing a fixed length string would cause unexplained errors at times when the length was longer than expected, and even sometimes when shorter than expected.
If you are going through and planning to change that in the application make sure there is no passing of the strings to an API or external DLL, and that the program does not require fixed length fields to be output, such as with many AS/400 import programs.
I personally never got to see a performance difference as I was running loops of 300k+ records, but had no choice but to provide and work with fixed lengths when I did. However VB likes to use undefined lengths by default so I would imagine the performance would be lower for fixed length.
Try writing a test app to perform a basic concatenation of two strings, and have it loop over the function like 50k times. Time the difference between the two of having one undefined length and the other fixed.

Is there a formal definition of character difference across a string and if so how is it calculated?

Overview
I'm looking to analyse the difference between two characters as part of a password strength checking process.
I'll explain what I'm trying to achieve and why and would like to know if what I'm looking to do is formally defined and whether there are any recommended algorithms for achieving this.
What I'm looking to do
Across a whole string, I'm looking to compare the current character with the previous character and determine how different they are.
As this relates to password strength checking, the difference between one character and it's predecessor in a string might be defined as being how predictable character N is from knowing character N - 1. There might be a formal definition for this of which I'm not aware.
Example
A password of abc123 could be arguably less secure than azu590. Both contain three letters followed by three numbers, however in the case of the former the sequence is more predictable.
I'm assuming that a password guesser might try some obvious sequences such that abc123 would be tried much before azu590.
Considering the decimal ASCII values for the characters in these strings, and given that b is 1 different from a and c is 1 different again from b, we could derive a simplistic difference calculation.
Ignoring cases where two consecutive characters are not in the same character class, we could say that abc123 has an overall character to character difference of 4 whereas azu590 has a similar difference of 25 + 5 + 4 + 9 = 43.
Does this exist?
This notion of character to character difference across a string might be defined, similar to the Levenshtein distance between two strings. I don't know if this concept is defined or what it might be called. Is it defined and if so what is it called?
My example approach to calculating the character to character difference across a string is a simple and obvious approach. It may be flawed, it may be ineffective. Are there any known algorithms for calculating this character to character difference effectively?
It sounds like you want a Markov Chain model for passwords. A Markov Chain has a number of states and a probability of transitioning between the states. In your case the states are the characters in the allowed character set and the probability of a transition is proportional to the frequency that those two letters appear consecutively. You can construct the Markov Chain by looking at the frequency of the transitions in an existing text, for example a freely available word list or password database.
It is also possible to use variations on this technique (Markov chain of order m) where you for example consider the previous two characters instead of just one.
Once you have created the model you can use the probability of generating the password from the model as a measure of its strength. This is the product of the probabilities of each state transition.
For general signals/time-series data, this is known as Autocorrelation.
You could try adapting the Durbin–Watson statistic and test for positive auto-correlation between the characters. A naïve way may be to use the unicode code-points of each character, but I'm sure that will not be good enough.

Resources