i need to create a string to store couples of key/value data, for example:
key1::value1||key2::value2||key3::value3
in deserializing it, i may encounter an error if the key or the value happen to contain || or ::
What are common techniques to deal with such situation? thanks
A common way to deal with this is called an escape character or qualifier. Consider this Comma-Separated line:
Name,City,State
John Doe, Jr.,Anytown,CA
Because the name field contains a comma, it of course gets split improperly and so on.
If you enclose each data value by qualifiers, the parser knows when to ignore the delimiter, as in this example:
Name,City,State
"John Doe, Jr.",Anytown,CA
Qualifiers can be optional, used only on data fields that need it. Many implementations will use qualifiers on every field, needed or not.
You may want to implement something similar for your data encoding.
Escape || when serializing, and unescape it when deserializing. A common C-like way to escape is to prepend \. For example:
{ "a:b:c": "foo||bar", "asdf": "\\|||x||||:" }
serialize => "a\:b\:c:foo\|\|bar||asdf:\\\\\|\|\|x\|\|\|\|\:"
Note that \ needs to be escaped (and double escaped due to being placed in a C-style string).
If we assume that you have total control over the input string, then the common way of dealing with this problem is to use an escape character.
Typically, the backslash-\ character is used as an escape to say that "the next character is a special character", so in this case it should not be used as a delimiter. So the parser would see || and :: as delimiters, but would see \|\| as two pipe characters || in either the key or the value.
The next problem is that we have overloaded the backslash. The problem is then, "how do I represent a backslash". This is sovled by saying that the backslash is also escaped, so to represent a \, you would have to say \\. So the parser would see \\ as \.
Note that if you use escape characters, you can use a single character for the delimiters, which might make things simpler.
Alternatively, you may have to restict the input and say that || and :: are just baned and fail/remove when the string is encoded.
A simple solution is to escape a separator (with a backslash, for instance) any time it occurs in data:
Name,City,State
John Doe\, Jr.,Anytown,CA
Of course, the separator will need to be escaped when it occurs in data as well; in this case, a backslash would become \\.
You can use non-ascii character as separator (e.g. vertical tab :-) ).
You can escape separator character in your data during serialization. For example: if you use one character as separator (key1:value1|key2:value2|...) and your data is:
this:is:key1 this|is|data1
this:is:key2 this|is|data2
you double every colon and pipe character in you data when you serialize it. So you will get:
this::is::key1:this||is||data1|this::is::key2:this||is||data2|...
During deserialization whenever you come across two colon or two pipe characters you know that this is not your separator but part of your data and that you have to change it to one character. On the other hand, every single colon or pipe character is you separator.
Use a prefix (say "a") for your special characters (say "b") present in the key and values to store them. This is called escaping.
Then decode the key and values by simply replacing any "ab" sequence with "b". Bear in mind that the prefix is also a special character. An example:
Prefix: \
Special characters: :, |, \
Encoded:
title:Slashdot\: News for Nerds. Stuff that Matters.|shortTitle:\\.
Decoded:
title=Slashdot: News for Nerds. Stuff that Matters.
shortTitle=\.
The common technique is escaping reserved characters, for example:
In urls you escape some characters
using %HEX representation:
http://example.com?aa=a%20b
In programming languages you escape
some characters with a slash prefix:
"\"hello\""
Related
I saw this in a Python 3 tutorial about how to download a file and this is what it kinda looks like.
from urllib import request
import requests
goog="http://realchart.finance.yahoo.com/table.csvs=GOOG&d=8&e=7&f=2016&g=d&a=7&b=19&c=2004&ignore=.csv"
rp=request.urlopen(goog)
s=rp.read()
cp=str(s)
m=cp.split('\\n')
dest='goog.csv'
fw=open(dest,'w')
for c in m:
fw.write(c+ '\n')
fw.close()
fr=open('goog.csv','r')
k=fr.read()
print(k)
Why was this used?
split('\\n')
Its true that the code only works properly when you use the double backslashes but why?
The backslash is a special character inside strings, its purpose is to introduce special characters into the strings, special characters that can't otherwise be written on a keyboard in a natural way, if at all. The most common being the newline '\n'.
However, since the backslash is special, how do one make a string contain an actual backslash? Simple: Use the backslash to escape itself! A double-backslash will be translated into a literal backslash.
In the context of this question, the text being searched contains a literal backslash, so to find this literal backslash one must use the double backslash.
<button onclick='window.alert("\n")'>alert not escaped</button>
<button onclick='window.alert("\\n")'>alert escaped</button>
In a string a single backslash is a so-called 'escape' character. This is used to include special characters like tab (\t) or a new line (\n).
While running an R-plugin in SPSS, I receive a Windows path string as input e.g.
'C:\Users\mhermans\somefile.csv'
I would like to use that path in subsequent R code, but then the slashes need to be replaced with forward slashes, otherwise R interprets it as escapes (eg. "\U used without hex digits" errors).
I have however not been able to find a function that can replace the backslashes with foward slashes or double escape them. All those functions assume those characters are escaped.
So, is there something along the lines of:
>gsub('\\', '/', 'C:\Users\mhermans')
C:/Users/mhermans
You can try to use the 'allowEscapes' argument in scan()
X=scan(what="character",allowEscapes=F)
C:\Users\mhermans\somefile.csv
print(X)
[1] "C:\\Users\\mhermans\\somefile.csv"
As of version 4.0, introduced in April 2020, R provides a syntax for specifying raw strings. The string in the example can be written as:
path <- r"(C:\Users\mhermans\somefile.csv)"
From ?Quotes:
Raw character constants are also available using a syntax similar to the one used in C++: r"(...)" with ... any character sequence, except that it must not contain the closing sequence )". The delimiter pairs [] and {} can also be used, and R can be used in place of r. For additional flexibility, a number of dashes can be placed between the opening quote and the opening delimiter, as long as the same number of dashes appear between the closing delimiter and the closing quote.
First you need to get it assigned to a name:
pathname <- 'C:\\Users\\mhermans\\somefile.csv'
Notice that in order to get it into a name vector you needed to double them all, which gives a hint about how you could use regex. Actually, if you read it in from a text file, then R will do all the doubling for you. Mind you it not really doubling the backslashes. It is being stored as a single backslash, but it's being displayed like that and needs to be input like that from the console. Otherwise the R interpreter tries (and often fails) to turn it into a special character. And to compound the problem, regex uses the backslash as an escape as well. So to detect an escape with grep or sub or gsub you need to quadruple the backslashes
gsub("\\\\", "/", pathname)
# [1] "C:/Users/mhermans/somefile.csv"
You needed to doubly "double" the backslashes. The first of each couple of \'s is to signal to the grep machine that what next comes is a literal.
Consider:
nchar("\\A")
# returns `[1] 2`
If file E:\Data\junk.txt contains the following text (without quotes): C:\Users\mhermans\somefile.csv
You may get a warning with the following statement, but it will work:
texinp <- readLines("E:\\Data\\junk.txt")
If file E:\Data\junk.txt contains the following text (with quotes): "C:\Users\mhermans\somefile.csv"
The above readlines statement might also give you a warning, but will now contain:
"\"C:\Users\mhermans\somefile.csv\""
So, to get what you want, make sure there aren't quotes in the incoming file, and use:
texinp <- suppressWarnings(readLines("E:\\Data\\junk.txt"))
I am looking for some best practices as far as handling csv and tab delimited files.
For CSV files I am already doing some formatting if a value contains a comma or double quote but what if the value contains a new line character? Should I leave the new line intact and encase the value in double quotes + escape any double quotes within the value?
Same question for tab delimited files. I assume the answer would be very similar if not the same.
Usually you keep \n unaltered while exploiting the fact that the newline char will be enclosed in a " " string. This doesn't create ambiguities but it's really ugly if you have to take a look to the file using a normal texteditor.
But it is how you should do since you don't escape anything inside a string in a CSV except for the double quote itself.
#Jack is right, that your best bet is to keep the \n unaltered, since you'll expect it inside of double-quotes if that is the case.
As with most things, I think consistency here is key. As far as I know, your values only need to be double-quoted if they span multiple lines, contain commas, or contain double-quotes. In some implementations I've seen, all values are escaped and double-quoted, since it makes the parsing algorithm simpler (there's never a question of escaping and double-quoting, and the reverse on reading the CSV).
This isn't the most space-optimized solution, but makes reading and writing the file a trivial affair, for both your own library and others that may consume it in the future.
For TSV, if you want lossless representation of values, the "Linear TSV" specification is worth considering: http://paulfitz.github.io/dataprotocols/linear-tsv/index.html
For obvious reasons, most such conventions adhere to the following at a minimum:
\n for newline,
\t for tab,
\r for carriage return,
\\ for backslash
Some tools add \0 for NUL.
Update
If you were forced to use a single char on a split method, which char would be the most reliable?
Definition of reliable: a split character that is not part of the individual sub strings being split.
We currently use
public const char Separator = ((char)007);
I think this is the beep sound, if i am not mistaken.
Aside from 0x0, which may not be available (because of null-terminated strings, for example), the ASCII control characters between 0x1 and 0x1f are good candidates. The ASCII characters 0x1c-0x1f are even designed for such a thing and have the names File Separator, Group Separator, Record Separator, Unit Separator. However, they are forbidden in transport formats such as XML.
In that case, the characters from the unicode private use code points may be used.
One last option would be to use an escaping strategy, so that the separation character can be entered somehow anyway. However, this complicates the task quite a lot and you cannot use String.Split anymore.
You can safely use whatever character you like as delimiter, if you escape the string so that you know that it doesn't contain that character.
Let's for example choose the character 'a' as delimiter. (I intentionally picked a usual character to show that any character can be used.)
Use the character 'b' as escape code. We replace any occurrence of 'a' with 'b1' and any occurrence of 'b' with 'b2':
private static string Escape(string s) {
return s.Replace("b", "b2").Replace("a", "b1");
}
Now, the string doesn't contain any 'a' characters, so you can put several of those strings together:
string msg = Escape("banana") + "a" + Escape("aardvark") + "a" + Escape("bark");
The string now looks like this:
b2b1nb1nb1ab1b1rdvb1rkab2b1rk
Now you can split the string on 'a' and get the individual parts:
b2b1nb1nb1
b1b1rdvb1rk
b2b1rk
To decode the parts you do the replacement backwards:
private static string Unescape(string s) {
return s.Replace("b1", "a").Replace("b2", "b");
}
So splitting the string and unencoding the parts is done like this:
string[] parts = msg.split('a');
for (int i = 0; i < parts.length; i++) {
parts[i] = Unescape(parts[i]);
}
Or using LINQ:
string[] parts = msg.Split('a').Select<string,string>(Unescape).ToArray();
If you choose a less common character as delimiter, there are of course fewer occurrences that will be escaped. The point is that the method makes sure that the character is safe to use as delimiter without making any assumptions about what characters exists in the data that you want to put in the string.
I usually prefer a '|' symbol as the split character. If you are not sure of what user enters in the text then you can restrict the user from entering some special characters and you can choose from those characters, the split character.
It depends what you're splitting.
In most cases it's best to use split chars that are fairly commonly used, for instance
value, value, value
value|value|value
key=value;key=value;
key:value;key:value;
You can use quoted identifiers nicely with commas:
"value", "value", "value with , inside", "value"
I tend to use , first, then |, then if I can't use either of them I use the section-break char §
Note that you can type any ASCII char with ALT+number (on the numeric keypad only), so § is ALT+21
\0 is a good split character. It's pretty hard (impossible?) to enter from keyboard and it makes logical sense.
\n is another good candidate in some contexts.
And of course, .Net strings are unicode, no need to limit yourself with the first 255. You can always use a rare Mongolian letter or some reserved or unused Unicode symbol.
There are overloads of String.Split that take string separators...
I'd personally say that it depends on the situation entirely; if you're writing a simple TCP/IP chat system, you obviously shouldn't use '\n' as the split.. But '\0' is a good character to use due to the fact that the users can't ever use it!
First of all, in C# (or .NET), you can use more than one split characters in one split operation.
String.Split Method (Char[]) Reference here
An array of Unicode characters that delimit the substrings in this instance, an empty array that contains no delimiters, or null reference (Nothing in Visual Basic).
In my opinion, there's no MOST reliable split character, however some are more suitable than others.
Popular split characters like tab, comma, pipe are good for viewing the un-splitted string/line.
If it's only for storing/processing, the safer characters are probably those that are seldom used or those not easily entered from the keyboard.
It also depend on the usage context. E.g. If you are expecting the data to contain email addresses, "#" is a no no.
Say we were to pick one from the ASCII set. There are quite a number to choose from. E.g. " ` ", " ^ " and some of the non-printable characters. Do beware of some characters though, not all are suitable. E.g. 0x00 might have adverse effect on some system.
It depends very much on the context in which it's used. If you're talking about a very general delimiting character then I don't think there is a one-size-fits-all answer.
I find that the ASCII null character '\0' is often a good candidate, or you can go with nitzmahone's idea and use more than one character, then it can be as crazy as you want.
Alternatively, you can parse the input and escape any instances of your delimiting character.
"|" pipe sign is mostly used when you are passing arguments.. to the method accepting just a string type parameter.
This is widely used used in SQL Server SPs as well , where you need to pass an array as the parameter. Well mostly it depends upon the situation where you need it.
When people talk about string delimiters, does that include quotes or does that mean everything except quotes?
It means any character used to define the beginning and end of a string (e.g. quotes but, in other contexts, other characters).
There's a subtle difference, if you're talking about string delimiters that nearly always means quotes, either " or '.
If you're talking about a delimited string, then you're normal talking about a string of tokens, with delimiters between them ie
"this,is,a,delimited,string" -
It's very common to use a comma, as the delimiter, but that leads to issues when the token already contains a comma - for instance
"one,million,dollars,$1,000,000"
In this instance it's common to further delimit the token so we get
"one,million,dollars,"$1,000,000""
another common alternative is to use an unusual character as the delimiter, and there's a minor convention to use the pipe symbol |
"one|million|dollars|$1,000,000"