Why this excel string comparison return fail? - excel

Is it an Excel bug? Anyone have experienced this issue, please help?

Just a thought but here's what MS says about TRIM
The TRIM function was designed to trim the 7-bit ASCII space character
(value 32) from text. In the Unicode character set, there is an
additional space character called the nonbreaking space character that
has a decimal value of 160. This character is commonly used in Web
pages as the HTML entity, . By itself, the TRIM function does
not remove this nonbreaking space character.
you might try this to replace the non-breaking space (if that is your problem here).
=TRIM(SUBSTITUTE(A5,CHAR(160),CHAR(32)))

I would have to agree with #Jeeped. Your formula looks correct in all aspects. It must be a non-printing character. If this data is coming from some outside source (I.e. another file) then there very well could be a non-printed character. I just typed in everything you had manually and came up with this.

Related

How was this invisible space created?

'{FileTitle​}' === '{FileTitle}'
// false
There's a space between the first string's last e and }
'{FileTitle​}'.length
// 12
'{FileTitle}'.length
// 11
There is Unicode character with code 8203 between those two characters. This is a 0-width space. Have a look at the corresponding Wikipedia article for more info.
This is a great example of a sometimes nasty problem :-)
If I copy your code to TextWrangler, then I see the space. If I chose "Hex Dump", then I see the hex bytes 0B 20. Considering the little endian context (thx to #axiac), this means the character 0x200B, decimal 8203.
For informations about specific unicode characters, use this: http://unicode-table.com/de/search/?q=8203 You'll see the description "Zero Width Space".
About how this character got into your code, one can only guess. Option one is, you wrote it in your editor unwittingly by hitting a certain key combination. Option two is, you copied it from a rich text document as a stowaway. Option three is, it got there because of some stumbled multibyte string operation.
A related problem is Ascii 0xA0 (or 0x00A0), the non-breakable space. It cannot be distinguished from a normal space by eye, but causes compiler syntax errors sometimes hard to resolve.
This one has no space
'{FileTitle}' === '{FileTitle}'
You just used different encoding.

Replace character with a safe character and vice-versa

Here's my problem:
I need to store sentences "somewhere" (it doesn't matter where).
The sentences must not contain spaces.
When I extract the sentences from that "somewhere", I need to restore the spaces.
So, before storing the sentence "I am happy" I could replace the spaces with a safe character, such as &. In C#:
theString.Replace(' ', '&');
This would yield 'I&am&happy'.
And when retrieving the sentence, I would to the reverse:
theString.Replace('&', ' ');
But what if the original sentence already contains the '&' character?
Say I would do the same thing with the sentence 'I am happy & healthy'. With the design above, the string would come back as 'I am happy healthy', since the '&' char has been replaced with a space.
(Of course, I could change the & character to a more unlikely symbol, such as ¤, but I want this to be bullet proof)
I used to know how to solve this, but I forgot how.
Any ideas?
Thanks!
Fredrik
Maybe you can use url encoding (percent encoding) as an inspiration.
Characters that are not valid in a url are escaped by writing %XX where XX is a numeric code that represents the character. The % sign itself can also be escaped in the same way, so that way you never run into problems when translating it back to the original string.
There are probably other similar encodings, and for your own application you can use an & just as well as a %, but by using an existing encoding like this, you can probably also find existing functions to do the encoding and decoding for you.

New lines in tab delimited or comma delimtted output

I am looking for some best practices as far as handling csv and tab delimited files.
For CSV files I am already doing some formatting if a value contains a comma or double quote but what if the value contains a new line character? Should I leave the new line intact and encase the value in double quotes + escape any double quotes within the value?
Same question for tab delimited files. I assume the answer would be very similar if not the same.
Usually you keep \n unaltered while exploiting the fact that the newline char will be enclosed in a " " string. This doesn't create ambiguities but it's really ugly if you have to take a look to the file using a normal texteditor.
But it is how you should do since you don't escape anything inside a string in a CSV except for the double quote itself.
#Jack is right, that your best bet is to keep the \n unaltered, since you'll expect it inside of double-quotes if that is the case.
As with most things, I think consistency here is key. As far as I know, your values only need to be double-quoted if they span multiple lines, contain commas, or contain double-quotes. In some implementations I've seen, all values are escaped and double-quoted, since it makes the parsing algorithm simpler (there's never a question of escaping and double-quoting, and the reverse on reading the CSV).
This isn't the most space-optimized solution, but makes reading and writing the file a trivial affair, for both your own library and others that may consume it in the future.
For TSV, if you want lossless representation of values, the "Linear TSV" specification is worth considering: http://paulfitz.github.io/dataprotocols/linear-tsv/index.html
For obvious reasons, most such conventions adhere to the following at a minimum:
\n for newline,
\t for tab,
\r for carriage return,
\\ for backslash
Some tools add \0 for NUL.

When is it acceptable to not trim a user input string?

Can someone give me a real-world scenario of a method/function with a string argument which came from user input (e.g. form field, parsed data from file, etc.) where leading or trailing spaces SHOULD NOT have been trimmed?
I can't ever recall such a situation for myself.
EDIT: Mind you, I didn't say trimming any whitespace. I said trimming leading or trailing (only) spaces (or whitespace).
Search string in any "Find" dialog in an editor.
Password input boxes. There's lots of data out there, where whitespace can genuinely be considered important part of the string. It narrows things down alot by making it starting and ending whitespace only, but there's still many examples. Stuff you pass through a PHP style nl2br function.
If you are inputting code. There may be a scenario where whitespace at the begining and end are necessary.
Also, look at Stack Overflow's markdown editor. Code examples are indented. If you posted just a code example, then it will require leading and trailing white space not be trimmed.
Perhaps a Whitespace interpreter.
Python....
A Stackoverflow answer, or more generally input written in markdown (four leading spaces -> code block).
A paragraph entry.
If the input is python code (say, for a pastebin kinda thing), you certainly can't trim leading white space; but you also can't trim trailing white space, because it could be a part of a multi-line string (triple quoted string).
I've used whitespace as a delimiter before, so there. Also, for anything that involves concatenating multiple inputs, removing leading/trailing whitespace can break formatting or possibly do worse. Aside from that, as Spencer said, for indented paragraphs you probably would not want to remove the leading whitespace.
Obviously passwords should not be trimmed. Passwords can contain leading or trailing whitespaces that need to be be treated as valid characters.

Most reliable split character

Update
If you were forced to use a single char on a split method, which char would be the most reliable?
Definition of reliable: a split character that is not part of the individual sub strings being split.
We currently use
public const char Separator = ((char)007);
I think this is the beep sound, if i am not mistaken.
Aside from 0x0, which may not be available (because of null-terminated strings, for example), the ASCII control characters between 0x1 and 0x1f are good candidates. The ASCII characters 0x1c-0x1f are even designed for such a thing and have the names File Separator, Group Separator, Record Separator, Unit Separator. However, they are forbidden in transport formats such as XML.
In that case, the characters from the unicode private use code points may be used.
One last option would be to use an escaping strategy, so that the separation character can be entered somehow anyway. However, this complicates the task quite a lot and you cannot use String.Split anymore.
You can safely use whatever character you like as delimiter, if you escape the string so that you know that it doesn't contain that character.
Let's for example choose the character 'a' as delimiter. (I intentionally picked a usual character to show that any character can be used.)
Use the character 'b' as escape code. We replace any occurrence of 'a' with 'b1' and any occurrence of 'b' with 'b2':
private static string Escape(string s) {
return s.Replace("b", "b2").Replace("a", "b1");
}
Now, the string doesn't contain any 'a' characters, so you can put several of those strings together:
string msg = Escape("banana") + "a" + Escape("aardvark") + "a" + Escape("bark");
The string now looks like this:
b2b1nb1nb1ab1b1rdvb1rkab2b1rk
Now you can split the string on 'a' and get the individual parts:
b2b1nb1nb1
b1b1rdvb1rk
b2b1rk
To decode the parts you do the replacement backwards:
private static string Unescape(string s) {
return s.Replace("b1", "a").Replace("b2", "b");
}
So splitting the string and unencoding the parts is done like this:
string[] parts = msg.split('a');
for (int i = 0; i < parts.length; i++) {
parts[i] = Unescape(parts[i]);
}
Or using LINQ:
string[] parts = msg.Split('a').Select<string,string>(Unescape).ToArray();
If you choose a less common character as delimiter, there are of course fewer occurrences that will be escaped. The point is that the method makes sure that the character is safe to use as delimiter without making any assumptions about what characters exists in the data that you want to put in the string.
I usually prefer a '|' symbol as the split character. If you are not sure of what user enters in the text then you can restrict the user from entering some special characters and you can choose from those characters, the split character.
It depends what you're splitting.
In most cases it's best to use split chars that are fairly commonly used, for instance
value, value, value
value|value|value
key=value;key=value;
key:value;key:value;
You can use quoted identifiers nicely with commas:
"value", "value", "value with , inside", "value"
I tend to use , first, then |, then if I can't use either of them I use the section-break char §
Note that you can type any ASCII char with ALT+number (on the numeric keypad only), so § is ALT+21
\0 is a good split character. It's pretty hard (impossible?) to enter from keyboard and it makes logical sense.
\n is another good candidate in some contexts.
And of course, .Net strings are unicode, no need to limit yourself with the first 255. You can always use a rare Mongolian letter or some reserved or unused Unicode symbol.
There are overloads of String.Split that take string separators...
I'd personally say that it depends on the situation entirely; if you're writing a simple TCP/IP chat system, you obviously shouldn't use '\n' as the split.. But '\0' is a good character to use due to the fact that the users can't ever use it!
First of all, in C# (or .NET), you can use more than one split characters in one split operation.
String.Split Method (Char[]) Reference here
An array of Unicode characters that delimit the substrings in this instance, an empty array that contains no delimiters, or null reference (Nothing in Visual Basic).
In my opinion, there's no MOST reliable split character, however some are more suitable than others.
Popular split characters like tab, comma, pipe are good for viewing the un-splitted string/line.
If it's only for storing/processing, the safer characters are probably those that are seldom used or those not easily entered from the keyboard.
It also depend on the usage context. E.g. If you are expecting the data to contain email addresses, "#" is a no no.
Say we were to pick one from the ASCII set. There are quite a number to choose from. E.g. " ` ", " ^ " and some of the non-printable characters. Do beware of some characters though, not all are suitable. E.g. 0x00 might have adverse effect on some system.
It depends very much on the context in which it's used. If you're talking about a very general delimiting character then I don't think there is a one-size-fits-all answer.
I find that the ASCII null character '\0' is often a good candidate, or you can go with nitzmahone's idea and use more than one character, then it can be as crazy as you want.
Alternatively, you can parse the input and escape any instances of your delimiting character.
"|" pipe sign is mostly used when you are passing arguments.. to the method accepting just a string type parameter.
This is widely used used in SQL Server SPs as well , where you need to pass an array as the parameter. Well mostly it depends upon the situation where you need it.

Resources