New lines in tab delimited or comma delimtted output - text

I am looking for some best practices as far as handling csv and tab delimited files.
For CSV files I am already doing some formatting if a value contains a comma or double quote but what if the value contains a new line character? Should I leave the new line intact and encase the value in double quotes + escape any double quotes within the value?
Same question for tab delimited files. I assume the answer would be very similar if not the same.

Usually you keep \n unaltered while exploiting the fact that the newline char will be enclosed in a " " string. This doesn't create ambiguities but it's really ugly if you have to take a look to the file using a normal texteditor.
But it is how you should do since you don't escape anything inside a string in a CSV except for the double quote itself.

#Jack is right, that your best bet is to keep the \n unaltered, since you'll expect it inside of double-quotes if that is the case.
As with most things, I think consistency here is key. As far as I know, your values only need to be double-quoted if they span multiple lines, contain commas, or contain double-quotes. In some implementations I've seen, all values are escaped and double-quoted, since it makes the parsing algorithm simpler (there's never a question of escaping and double-quoting, and the reverse on reading the CSV).
This isn't the most space-optimized solution, but makes reading and writing the file a trivial affair, for both your own library and others that may consume it in the future.

For TSV, if you want lossless representation of values, the "Linear TSV" specification is worth considering: http://paulfitz.github.io/dataprotocols/linear-tsv/index.html
For obvious reasons, most such conventions adhere to the following at a minimum:
\n for newline,
\t for tab,
\r for carriage return,
\\ for backslash
Some tools add \0 for NUL.

Related

How to break a string over multiple lines and preserve spaces in YAML?

Please note, that the question is similar like this one, but still different so that those answers won't solve my problem:
For insertion of control characters like e.g. \x08, it seems that I have to use double quotes ".
All spaces needs to be preserved exactly as given. For line breaks I use explicitly \n.
I have some string data which I need to store in YAML, e.g.:
" This is my quite long string data "
"This is my quite long string data"
"This_is_my_quite_long_string_data"
"Sting data\nwhich\x08contains control characters"
and need it in YAML as something like this:
Key: " This is my" +
" quite long " +
" string data "
This is no problem as long as I stay on a single line, but I don't know how to put the string content to multiple lines.
YAML block scalar styles (>, |) won't help here, because they don't allow escaping and they even do some whitespace stripping, newline / space substitution which is useless for my case.
Looks that the only way seems to be using double quoting " and backslashes \, like this:
Key: "\
This is \
my quite \
long string data\
"
Trying this in YAML online parser results in "This is my quite long string data" as expected.
But it unfortunately fail if one of the "sub-lines" has leading space, like this:
Key: "\
This is \
my quite\
long st\
ring data\
"
This results in "This is my quitelong string data", removed the space between the words quite and long of this example. The only thing that comes to my mind to solve that, is to replace the first leading space of each sub-line by \x20 like this:
Key: "\
This is \
my quite\
\x20long st\
ring data\
"
As I'd chosen YAML to have a best possible human readable format, I find that \x20 a bit ugly solution. Maybe someone know a better approach?
For keeping human readable, I also don't want to use !!binary for this.
Instead of \x20, you can simply escape the first non-indentation space on the line:
Key: "\
This is \
my quite\
\ long st\
ring data\
"
This works with multiple spaces, you only need to escape the first one.
You are right in your observation that control characters can only be represented in double quoted scalars.
However the parser doesn't fail if the sub-lines (in YAML speak: continuation lines) have a leading space. It is your interpretation of the YAML standard that is incorrect. The standard explicitly states that for multi-line double quoted scalars:
All leading and trailing white space characters are excluded from the content.
So you can put as many spaces as you want before long as you want, it will not make a difference.
The representer for double quoted scalars for Python (both in ruamel.yaml and PyYAML) always does represent newlines as \n. I am not aware of YAML representers in other languages where you have more control over this (and e.g. get double newlines to represent \n in your double quoted scalars). So you probably have to write your own representer.
While writing a representer you can try to make the line breaking be smart, in that it minimizes the number of escaped spaces (by putting them between words on the same line). But especially on strings with a high double space to word ratio, combined with a small width to operate in, it will be hard (if not impossible) to do without escaped spaces.
Such a representer should IMO first check if double quoting is necessary (i.e. there are control characters apart from newlines). If not, and there are newlines you are probably better of representing the string a block style literal scalar (for which spaces at the beginning or end of line are not excluded).

vim search and replace between number

I have a pattern where there are double-quotes between numbers in a CSV file.
I can search for the pattern by [0-9]\"[0-9], but how do I retain value while removing the double quote. CSV format is like this:
"1234"5678","Text1","Text2"
"987654321","Text3","text4"
"7812891"3","Text5","Text6"
As you may notice there are double quotes between some numbers which I want to remove.
I have tried the following way, which is incorrect:
:%s/[0-9]\"[0-9]/[0-9][0-9]/g
Is it possible to execute a command at every search pattern, maybe go one character forward and delete it. How can "lx" be embedded in search and replace.
You need to capture groups. Try:
:%s/\(\d\)"\(\d\)/\1\2/g
[A digit can also be denoted by \d.]
I know that this question has been answered already, but here's another approach:
:%s/\d\zs"\ze\d
Explanation:
%s   Substitute for the whole buffer
\d   look up for a digit
\zs set the start of match here
"     look up for a double-quote
\ze set the end of match here
\d   look up for a digit
That makes the substitute command to match only the double-quote surrounded by digits.
Omitting the replacement string just deletes the match.
You need boundaries to use in regular expression.
Try this:
:%s/\([0-9]\)"\([0-9]\)/\1\2/g
A bit naive solution:
%s/^"/BEGINNING OF LINE QUOTE MARK/g
%s/\",\"/quote comma quote/g
%s/\"$/quota end of line/g
%s/\"//g
%s/quota end of line/"/g
%s/quote comma quote/","/g
%s/BEGINNING OF LINE QUOTE MARK/"/g
A macro can be created quite easy out of it and invoked as many times as needed.

Characters to separate value

i need to create a string to store couples of key/value data, for example:
key1::value1||key2::value2||key3::value3
in deserializing it, i may encounter an error if the key or the value happen to contain || or ::
What are common techniques to deal with such situation? thanks
A common way to deal with this is called an escape character or qualifier. Consider this Comma-Separated line:
Name,City,State
John Doe, Jr.,Anytown,CA
Because the name field contains a comma, it of course gets split improperly and so on.
If you enclose each data value by qualifiers, the parser knows when to ignore the delimiter, as in this example:
Name,City,State
"John Doe, Jr.",Anytown,CA
Qualifiers can be optional, used only on data fields that need it. Many implementations will use qualifiers on every field, needed or not.
You may want to implement something similar for your data encoding.
Escape || when serializing, and unescape it when deserializing. A common C-like way to escape is to prepend \. For example:
{ "a:b:c": "foo||bar", "asdf": "\\|||x||||:" }
serialize => "a\:b\:c:foo\|\|bar||asdf:\\\\\|\|\|x\|\|\|\|\:"
Note that \ needs to be escaped (and double escaped due to being placed in a C-style string).
If we assume that you have total control over the input string, then the common way of dealing with this problem is to use an escape character.
Typically, the backslash-\ character is used as an escape to say that "the next character is a special character", so in this case it should not be used as a delimiter. So the parser would see || and :: as delimiters, but would see \|\| as two pipe characters || in either the key or the value.
The next problem is that we have overloaded the backslash. The problem is then, "how do I represent a backslash". This is sovled by saying that the backslash is also escaped, so to represent a \, you would have to say \\. So the parser would see \\ as \.
Note that if you use escape characters, you can use a single character for the delimiters, which might make things simpler.
Alternatively, you may have to restict the input and say that || and :: are just baned and fail/remove when the string is encoded.
A simple solution is to escape a separator (with a backslash, for instance) any time it occurs in data:
Name,City,State
John Doe\, Jr.,Anytown,CA
Of course, the separator will need to be escaped when it occurs in data as well; in this case, a backslash would become \\.
You can use non-ascii character as separator (e.g. vertical tab :-) ).
You can escape separator character in your data during serialization. For example: if you use one character as separator (key1:value1|key2:value2|...) and your data is:
this:is:key1 this|is|data1
this:is:key2 this|is|data2
you double every colon and pipe character in you data when you serialize it. So you will get:
this::is::key1:this||is||data1|this::is::key2:this||is||data2|...
During deserialization whenever you come across two colon or two pipe characters you know that this is not your separator but part of your data and that you have to change it to one character. On the other hand, every single colon or pipe character is you separator.
Use a prefix (say "a") for your special characters (say "b") present in the key and values to store them. This is called escaping.
Then decode the key and values by simply replacing any "ab" sequence with "b". Bear in mind that the prefix is also a special character. An example:
Prefix: \
Special characters: :, |, \
Encoded:
title:Slashdot\: News for Nerds. Stuff that Matters.|shortTitle:\\.
Decoded:
title=Slashdot: News for Nerds. Stuff that Matters.
shortTitle=\.
The common technique is escaping reserved characters, for example:
In urls you escape some characters
using %HEX representation:
http://example.com?aa=a%20b
In programming languages you escape
some characters with a slash prefix:
"\"hello\""

When is it acceptable to not trim a user input string?

Can someone give me a real-world scenario of a method/function with a string argument which came from user input (e.g. form field, parsed data from file, etc.) where leading or trailing spaces SHOULD NOT have been trimmed?
I can't ever recall such a situation for myself.
EDIT: Mind you, I didn't say trimming any whitespace. I said trimming leading or trailing (only) spaces (or whitespace).
Search string in any "Find" dialog in an editor.
Password input boxes. There's lots of data out there, where whitespace can genuinely be considered important part of the string. It narrows things down alot by making it starting and ending whitespace only, but there's still many examples. Stuff you pass through a PHP style nl2br function.
If you are inputting code. There may be a scenario where whitespace at the begining and end are necessary.
Also, look at Stack Overflow's markdown editor. Code examples are indented. If you posted just a code example, then it will require leading and trailing white space not be trimmed.
Perhaps a Whitespace interpreter.
Python....
A Stackoverflow answer, or more generally input written in markdown (four leading spaces -> code block).
A paragraph entry.
If the input is python code (say, for a pastebin kinda thing), you certainly can't trim leading white space; but you also can't trim trailing white space, because it could be a part of a multi-line string (triple quoted string).
I've used whitespace as a delimiter before, so there. Also, for anything that involves concatenating multiple inputs, removing leading/trailing whitespace can break formatting or possibly do worse. Aside from that, as Spencer said, for indented paragraphs you probably would not want to remove the leading whitespace.
Obviously passwords should not be trimmed. Passwords can contain leading or trailing whitespaces that need to be be treated as valid characters.

Are quotes a type of string delimiter? Or does 'delimiter' mean other types of characters not including quotes?

When people talk about string delimiters, does that include quotes or does that mean everything except quotes?
It means any character used to define the beginning and end of a string (e.g. quotes but, in other contexts, other characters).
There's a subtle difference, if you're talking about string delimiters that nearly always means quotes, either " or '.
If you're talking about a delimited string, then you're normal talking about a string of tokens, with delimiters between them ie
"this,is,a,delimited,string" -
It's very common to use a comma, as the delimiter, but that leads to issues when the token already contains a comma - for instance
"one,million,dollars,$1,000,000"
In this instance it's common to further delimit the token so we get
"one,million,dollars,"$1,000,000""
another common alternative is to use an unusual character as the delimiter, and there's a minor convention to use the pipe symbol |
"one|million|dollars|$1,000,000"

Resources