NUnit - how to compare strings containing composite Unicode characters? - string

I'm using NUnit v2.5 to compare strings that contain composite Unicode characters.
Although comparison itself works fine, a caret indicating first difference seems to be misplaced.
UPD: I've ended up with overridden EqualConstraint that in turn invokes a custom TextMessageWriter, so I no longer need an answer. See for solution below.
Here's the snippet:
string s1 = "ใช้งานง่าย";
string s2 = "ใช้งานงาย";
Assert.That(s1, Is.EqualTo(s2));
Here's the output:
Expected: "ใช้งานงาย"
But was: "ใช้งานง่าย"
------------------^
The arrow indicating first different character seems to be off 2 positions (as many as there are tone marks above). For longer strings, it becomes a real pain.
I have attempted String.Normalize() but it wouldn't work either.
How can I overcome this problem? Thanks for your help. See my answer below.

When you are comparing Unicode strings, you must always normalize both sides of the comparison, and in the same way. It is not good enough to do binary compare of s1 and s2, because canonically equivalent strings would not test binary equivalent.
Positing the existence of four trivial normalization function, one for each of the four normalization forms, you would want to test NFD(s1) for binary eqality to NFD(s2). It doesn't matter whether you use NFD or NFC there, but you must do the same thing to both strings.
For the k-compat functions, NFKD and NFKD, those are useful when doing string searching, because they improve the recall at the cost of some precision. For example NFKD("™") would be equal to NFKD("TM"). This is what Adobe Reader does, for example, when you run searches on documents: it always runs the search in k-compat mode, so that your searches have a better chance at finding things. However, unlike NFC and NFD, the k-compat functions NFKC and NFKD lose information and are not reversible. With simple NFD and NFC, though, you can always get back to the other one.

You should be able to use the code from this answer to convert each string to an escaped version of the original string. Composite characters will become a single escaped \u codepoint, while combining characters will be a series of such escapes. Then run your Assert on these escaped versions of the string.

I think I cannot find any better answer, so answering my own question.
Cause.
There are many languages using non-spacing modifiers for characters. For European languages, there are substitutions, e.g. "u" (U+0075) + "¨" (U+00A8) = "ü" (U+00FC). In this case, solution by #tchrist is quite sufficient.
However, for complex writing systems, there is no substitution for non-spacing modifiers. Therefore, NUnit's TextMessageWriter.WriteCaretLine(int mismatch) treats mismatch parameter as a byte offset, while screen representation of Thai string may be shorter than the length of caret line ("-----^").
Solution.
Force WriteCaretLine(int mismatch) to respect non-spacing modifiers, reducing mismatch value to the number of non-spacing modifiers occurred before this offset.
Implement all supplementary classes that are actually needed only to make your new code invoked.
Along with Thai, I have tested it with Devanagari and Tibetan. It works as expected.
Yet another pitfall. If you're using NUnit with Visual Studio through ReSharper like I do, you have to configure your Internet Explorer's fonts (it cannot be managed with R#) so that it used proper monospaced fonts for Thai, Devanagari, etc.
Implementation.
Inherit TextMessageWriter and override its DisplayStringDifferences;
Implement your own ClipExpectedAndActual and FindMismatchPosition - here are non-spacing modifiers are respected; Proper clipping is needed since it may also impact calculation of non-spacing elements.
Inherit EqualConstraint and override its WriteMessageTo(MessageWriter writer) so that your MessageWriter was used;
Optionally, create a custom wrapper for simple invocation of custom constraint.
The source code goes below. About 80% of the code doesn't do anything useful, but it's included due to access levels in original code.
// Step 1.
public class ThaiMessageWriter : TextMessageWriter
{
/// <summary>
/// This method is merely a copy of the original method taken from NUnit sources,
/// except that it changes meaning of <paramref name="mismatch"/> before the caret line is displayed.
/// <remarks>
/// Originally passed <paramref name="mismatch"/> contains byte offset, while proper display of caret requires
/// it position to be calculated in character placeholder units. They are different in case of
/// over- or under-string Unicode characters like acute mark or complex script (Thai)
/// </remarks>
/// </summary>
/// <param name="clipping"></param>
public override void DisplayStringDifferences(string expected, string actual, int mismatch, bool ignoreCase, bool clipping)
{
// Maximum string we can display without truncating
int maxDisplayLength = MaxLineLength
- PrefixLength // Allow for prefix
- 2; // 2 quotation marks
int mismatchOffset = mismatch;
if (clipping)
MsgUtils2.ClipExpectedAndActual(ref expected, ref actual, maxDisplayLength, mismatchOffset);
expected = MsgUtils.EscapeControlChars(expected);
actual = MsgUtils.EscapeControlChars(actual);
// The mismatch position may have changed due to clipping or white space conversion
int mismatchInCharPlaceholders = MsgUtils2.FindMismatchPosition(expected, actual, 0, ignoreCase);
Write(Pfx_Expected);
WriteExpectedValue(expected);
if (ignoreCase)
WriteModifier("ignoring case");
WriteLine();
WriteActualLine(actual);
//DisplayDifferences(expected, actual);
if (mismatch >= 0)
WriteCaretLine(mismatchInCharPlaceholders);
}
// Copied due to private
/// <summary>
/// Write the generic 'Actual' line for a constraint
/// </summary>
/// <param name="constraint">The constraint for which the actual value is to be written</param>
private void WriteActualLine(Constraint constraint)
{
Write(Pfx_Actual);
constraint.WriteActualValueTo(this);
WriteLine();
}
// Copied due to private
/// <summary>
/// Write the generic 'Actual' line for a given value
/// </summary>
/// <param name="actual">The actual value causing a failure</param>
private void WriteActualLine(object actual)
{
Write(Pfx_Actual);
WriteActualValue(actual);
WriteLine();
}
// Copied due to private
private void WriteCaretLine(int mismatch)
{
// We subtract 2 for the initial 2 blanks and add back 1 for the initial quote
WriteLine(" {0}^", new string('-', PrefixLength + mismatch - 2 + 1));
}
}
// Step 2.
public static class MsgUtils2
{
private static readonly string ELLIPSIS = "...";
/// <summary>
/// Almost a copy of MsgUtil.ClipExpectedAndActual method
/// </summary>
/// <param name="expected"></param>
/// <param name="actual"></param>
/// <param name="maxDisplayLength"></param>
/// <param name="mismatch"></param>
public static void ClipExpectedAndActual(ref string expected, ref string actual, int maxDisplayLength, int mismatch)
{
// Case 1: Both strings fit on line
int maxStringLength = Math.Max(expected.Length, actual.Length);
if (maxStringLength <= maxDisplayLength)
return;
// Case 2: Assume that the tail of each string fits on line
int clipLength = maxDisplayLength - ELLIPSIS.Length;
int clipStart = maxStringLength - clipLength;
// Case 3: If it doesn't, center the mismatch position
if (clipStart > mismatch)
clipStart = Math.Max(0, mismatch - clipLength / 2);
// shift both clipStart and maxDisplayLength if they split non-placeholding symbol
AdjustForNonPlaceholdingCharacter(expected, ref clipStart);
AdjustForNonPlaceholdingCharacter(expected, ref maxDisplayLength);
expected = MsgUtils.ClipString(expected, maxDisplayLength, clipStart);
actual = MsgUtils.ClipString(actual, maxDisplayLength, clipStart);
}
private static void AdjustForNonPlaceholdingCharacter(string expected, ref int index)
{
while (index > 0 && CharUnicodeInfo.GetUnicodeCategory(expected[index]) == UnicodeCategory.NonSpacingMark)
{
index--;
}
}
static public int FindMismatchPosition(string expected, string actual, int istart, bool ignoreCase)
{
int length = Math.Min(expected.Length, actual.Length);
string s1 = ignoreCase ? expected.ToLower() : expected;
string s2 = ignoreCase ? actual.ToLower() : actual;
int iSpacingCharacters = 0;
for (int i = 0; i < istart; i++)
{
if (CharUnicodeInfo.GetUnicodeCategory(s1[i]) != UnicodeCategory.NonSpacingMark)
iSpacingCharacters++;
}
for (int i = istart; i < length; i++)
{
if (s1[i] != s2[i])
return iSpacingCharacters;
if (CharUnicodeInfo.GetUnicodeCategory(s1[i]) != UnicodeCategory.NonSpacingMark)
iSpacingCharacters++;
}
//
// Strings have same content up to the length of the shorter string.
// Mismatch occurs because string lengths are different, so show
// that they start differing where the shortest string ends
//
if (expected.Length != actual.Length)
return length;
//
// Same strings : We shouldn't get here
//
return -1;
}
}
// Step 3.
public class ThaiEqualConstraint : EqualConstraint
{
private readonly string _expected;
// WTF expected is private?
public ThaiEqualConstraint(string expected) : base(expected)
{
_expected = expected;
}
public override void WriteMessageTo(MessageWriter writer)
{
// redirect output to customized MessageWriter
var myMessageWriter = new ThaiMessageWriter();
base.WriteMessageTo(myMessageWriter);
writer.Write(myMessageWriter);
}
}
// Step 4.
public static class ThaiText
{
public static EqualConstraint IsEqual(string expected)
{
return new ThaiEqualConstraint(expected);
}
}

Related

Misleading exception message in GatewayMethodInboundMessageMapper with un-annotated parameters

The following code throws a MessagingException with message At most one parameter (or expression via method-level #Payload) may be mapped to the payload or Message. Found more than one on method [public abstract java.lang.Integer org.example.PayloadAndGatewayHeader$ArithmeticGateway.add(int,int)].
#MessagingGateway
interface ArithmeticGateway {
#Gateway(requestChannel = "add.input", headers = #GatewayHeader(name = "operand", expression = "#args[1]"))
Integer add(#Payload final int a, final int b);
}
The desired functionality could be achieved with something like:
#MessagingGateway
interface ArithmeticGateway {
#Gateway(requestChannel = "add.input", headers = #GatewayHeader(name = "operand", expression = "#args[1]"))
#Payload("#args[0]")
Integer add(final int a, final int b);
}
Should the first version also work? Nevertheless I believe the error message could be improved.
A sample project can be found here. Please check org.example.PayloadAndGatewayHeader and org.example.PayloadAndGatewayHeaderTest.
EDIT
The purpose of #GatewayHeader was to show why one may want to have additional parameters that will not be part of the payload but I am afraid it created confusion. Here is a more streamlined example:
#MessagingGateway
interface ArithmeticGateway {
#Gateway(requestChannel = "identity.input")
Integer identity(#Payload final int a, final int unused);
}
Shouldn't the unused parameter be ignored since there is already another one that is annotated with #Payload?
You can't mix parameter annotations (which are static) with expressions (which are dynamic) because the static code analysis can't anticipate what the dynamic expression will resolve to at runtime. It is probably unlikely, but there theoretically could be conditions in the expression. In any case, it can't determine at analysis time that the expression will provide a value for #args[1] at runtime (it could, of course for this simple case, but not all cases are this simple).
Use one or the other; use your second approach or
Integer add(#Payload final int a, #Header("operand") final int b);

How does switch statements work in java 1.5 with string arguments?

Can i implement switch statements(by passing string arguments) in java 5 without making use of enums?I tried doing it using hashcode but i got an error
package com.list;
import java.util.Scanner;
public class SwitchDays implements Days {
static final int str = "sunday".hashCode();
public static void main(String[] args) {
Scanner in=new Scanner(System.in);
String day= in.nextLine();
switch (day.hashCode()) {
case str:
System.out.println(day);
break;
default:
break;
}
}
}
str in case str given an error:
case expressions must be constant expressions
Please guide.
The problem is that str is referring to the expression "sunday".hashCode(), which is not a compile time constant expression as described by the JLS:
A constant expression is an expression denoting a value of primitive
type or a String that does not complete abruptly and is composed using
only the following: ... Qualified names (§6.5.6.2) of the form
TypeName . Identifier that refer to constant variables (§4.12.4).
When you check the definition of constant variables:
A constant variable is a final variable of primitive type or type
String that is initialized with a constant expression (§15.28).
Whether a variable is a constant variable or not may have implications
with respect to class initialization (§12.4.1), binary compatibility
(§13.1, §13.4.9), and definite assignment (§16 (Definite Assignment)).
Since "sunday".hashCode() does not meet this requirements, you get the error.
If you would change "sunday".hashCode() to a real compile time constant like 3 it would compile.
The most straight forward solution is to make an enum, ie.
enum Days{
sunday, monday;
}
Then it could be used as:
Day d = Day.valueOf("sunday");
switch(d){
case sunday:
System.out.println("ONE");
break;
case monday:
System.out.println("TWO");
break;
}

C++ : Strings, Structures and Access Violation Writing Locations

I'm attempting to try and use a string input from a method and set that to a variable of a structure, which i then place in a linked list. I didn't include, all of code but I did post constructor and all that good stuff. Now the code is breaking at the lines
node->title = newTitle;
node->isbn = newISBN;
So newTitle is the string input from the method that I'm trying to set to the title variable of the Book structure of the variable node. Now, I'm assuming this has to do with a issue with pointers and trying to set data to them, but I can't figure out a fix/alternative.
Also, I tried using
strcpy(node->title, newTitle)
But that had an issue with converting the string into a list of chars because strcpy only uses a list of characters. Also tried a few other things, but none seemed to pan out, help with an explanation would be appreciated.
struct Book
{
string title;
string isbn;
struct Book * next;
};
//class LinkedList will contains a linked list of books
class LinkedList
{
private:
Book * head;
public:
LinkedList();
~LinkedList();
bool addElement(string title, string isbn);
bool removeElement(string isbn);
void printList();
};
//Constructor
//It sets head to be NULL to create an empty linked list
LinkedList::LinkedList()
{
head = NULL;
}
//Description: Adds an element to the link in alphabetical order, unless book with
same title then discards
// Returns true if added, false otherwise
bool LinkedList::addElement(string newTitle, string newISBN)
{
struct Book *temp;
struct Book *lastEntry = NULL;
temp = head;
if (temp==NULL) //If the list is empty, sets data to first entry
{
struct Book *node;
node = (Book*) malloc(sizeof(Book));
node->title = newTitle;
node->isbn = newISBN;
head = node;
}
while (temp!=NULL)
{
... //Rest of Code
Note that your Book struct is already a linked list implementation, so you don't need the LinkedList class at all, or alternatively you don't need the 'next' element of the struct.
But there's no reason from the last (long) code snippet you pasted to have an error at the lines you indicated. node->title = newTitle should copy the string in newTitle to the title field of the struct. The string object is fixed size so it's not possible to overwrite any buffer and cause a seg fault.
However, there may be memory corruption from something you do further up the code, which doesn't cause an error until later on. The thing to look for is any arrays, including char[], that you might be overfilling. Another idea is you mention you save method parameters. If you copy, it's ok, but if you do something like
char* f() {
char str[20];
strcpy(str, "hello");
return str;
}
...then you've got a problem. (Because str is allocated on the stack and you return only the pointer to a location that won't be valid after the function returns.) Method parameters are local variables.
The answer you seek can be found here.
In short: the memory malloc returns does not contain a properly constructed object, so you can't use it as such. Try using new / delete instead.

Constants in Haxe

How do you create public constants in Haxe? I just need the analog of good old const in AS3:
public class Hello
{
public static const HEY:String = "hey";
}
The usual way to declare a constant in Haxe is using the static and inline modifiers.
class Main {
public static inline var Constant = 1;
static function main() {
trace(Constant);
trace(Test.Constant);
}
}
If you have a group of related constants, it can often make sense to use an enum abstract. Values of enum abstracts are static and inline implicitly.
Note that only the basic types (Int, Float, Bool) as well as String are allowed to be inline, for others it will fail with this error:
Inline variable initialization must be a constant value
Luckily, Haxe 4 has introduced a final keyword which can be useful for such cases:
public static final Regex = ~/regex/;
However, final only prevents reassignment, it doesn't make the type immutable. So it would still be possible to add or remove values from something like static final Values = [1, 2, 3];.
For the specific case of arrays, Haxe 4 introduces haxe.ds.ReadOnlyArray which allows for "constant" lists (assuming you don't work around it using casts or reflection):
public static final Values:haxe.ds.ReadOnlyArray<Int> = [1, 2, 3];
Values = []; // Cannot access field or identifier Values for writing
Values.push(0); // haxe.ds.ReadOnlyArray<Int> has no field push
Even though this is an array-specific solution, the same approach can be applied to other types as well. ReadOnlyArray<T> is simply an abstract type that creates a read-only "view" by doing the following:
it wraps Array<T>
it uses #:forward to only expose fields that don't mutate the array, such as length and map()
it allows implicit casts from Array<T>
You can see how it's implemented here.
For non-static variables and objects, you can give them shallow constness as shown below:
public var MAX_COUNT(default, never):Int = 100;
This means you can read the value in the 'default' way but can 'never' write to it.
More info can be found http://adireddy.github.io/haxe/keywords/never-inline-keywords.

Structure Reading Theory Problem

Iam have a DBC file, which is a database file for a game, containing ingame usable spell data, like ID, SpellName, Category etc...
Struct is something like this:
[StructLayout(LayoutKind.Sequential, CharSet = CharSet.Ansi, Pack = 1)]
public struct SpellEntry
{
public uint ID;
public uint Category;
public float speed;
[MarshalAs(UnmanagedType.ByValArray, SizeConst = 8, ArraySubType = UnmanagedType.I4)]
public int[] Reagent;
public int EquippedItemClass;
[MarshalAs(UnmanagedType.LPStr)] // Crash here
public string SpellName;
}
Iam reading the file with a binary reader, and marshaling it to the struct. Snippet:
binReader.BaseStream.Seek(DBCFile.HEADER_SIZE + (index * 4 * 234), SeekOrigin.Begin);
buff = binReader.ReadBytes(buff.Length);
GCHandle handdle = GCHandle.Alloc(buff, GCHandleType.Pinned);
Spell.SpellEntry testspell = (Spell.SpellEntry)Marshal.PtrToStructure(handdle.AddrOfPinnedObject(), typeof(Spell.SpellEntry));
handdle.Free();
Now to be more complex, lets see how does the DBC file storing the strings, for example the SpellName. Its not in the records, strings are contained in the end of the file, in a "string table" block. The string data in the records contains a number (offset) to the string in the string table. (so its not really a string).
I managed to read all the strings from the string block (at the end of the file), to a string[]. (this is dont before start reading the records)
Then I would start reading the records, but first problem Is :
1.) I cant read it, because it "crashes" on the last line of my struct (because its not a string really)
2.) I cant assign a string to the number.
When I read it, it will be a number, but at the end, as a result, I have to assign that string to the SpellName, thats got pointed by the number, in the string table. Jeez .
public struct SpellEntry
{
//...
private int SpellNameOffset;
public string SpellName {
get { return Mumble.GetString(SpellNameOffset); }
}
}
This is hard to get right, Mumble must be a static class since you cannot add any members to SpellEntry. That screws up Marshal.SizeOf(), making it too large. You'll need to initialize Mumble so that its static GetString() method can access the string table. Moving the SpellName property into another class solves the problem but makes the code ugly too.
This is liable to confuse you badly. If you got a version going that uses BitConverter then you're definitely better off by using it instead. Separating the file format from the runtime format is in fact an asset here.

Resources