Why m4 uses two different characters for quotes? - linux

In virtually every language quoting strings is straightforward - you put some stuff before a string and then the same stuff at the end of a string (maybe mirrored), for example:
"string"
'string'
R"(string)"
m4 macro processor is different though, because strings are quoted using backtick and single quote like this:
`string'
My question is: does this approach have any technical justification or is it just an expression of authors creativity?

Quoting Wikipedia, it is related to controlling macro expansion in strings:
Unlike most languages, strings in m4 are quoted using the backtick (`)
as the starting delimiter, and apostrophe (') as the ending delimiter.
The use of separate starting and ending delimiters allows for the
arbitrary nesting of quotation marks in strings, allowing a fine
degree of control of how and when macro expansion takes place in
different parts of a string.

Related

Why are "Here Strings" called "Here Strings"?

What is the reasoning behind the name?
SS64 explains here strings in PowerShell as follows:
A here string is a single-quoted or double-quoted string which can span multiple lines.
Expressions in single-quoted strings are not evaluated.
All the lines in a here-string are interpreted as strings, even though they are not enclosed in quotation marks.
$myHereString = #'
some text with "quotes" and variable names $printthis
some more text
'#
They have this name in PowerShell because it is borrowed from Unix style shells, like many other elements and concepts in PowerShell:
From Wikipedia:
Here documents originate in the Unix shell, and are found in sh, csh, ksh, bash and zsh, among others.
As for why that name was used originally, the article does not strictly cover etymology, but based on this description:
In computing, a here document (here-document, here-text, heredoc, hereis, here-string or here-script) is a file literal or input stream literal: it is a section of a source code file that is treated as if it were a separate file. The term is also used for a form of multiline string literals that use similar syntax, preserving line breaks and other whitespace (including indentation) in the text.
It seems logical that the "here" part of the name refers to a file being included "here" (as in, at this point).
Just for completeness, it is worth noting that PowerShell also supports newlines directly in both single and double quoted [string]s, like so:
$myString = 'some text with "quotes" and variable names $printthis
some more text'
A better example of where a here-string is useful is when you need both types of quotes:
$myHereString = #'
some text with "quotes" and variable names $printthis
some more text
and my grandma's soup
'#
If that were just a single quoted string, the ' in grandma's would need to be escaped.
If it were a double quoted string, you'd need to escape the instances of " and you'd need to escape the $ if you didn't want variable expansion.
Interesting aside: the differences between how here-strings are interpreted form the basis of this demonstration of writing a bash/powershell/win batch polyglot (a script whose contents can be executed in any of those environments).

Lua - How to remove quotes around integers in strings

So I have this string:
{"scores":{"1":["John",60],"2":["Jude",60],"3":["Max",60],"4":["Kyle",60],"5":["Smith",60],"6":["Mark",50],"7":["Luke",40],"8":["Anne",30],"9":["Bruce",20],"10":["kazuo",10]}}
There are a number of integers there that have quotes around them, and I want to get rid of them. How do I do that? I already tried out:
print(string.gsub(string, '/"(\d)"/', "%1"));
but it does not work. :(
Lua does not have regular expressions like Perl, instead, it does have patterns. These are similar with a few differences.
There is no need for delimiting slashes / /, and the escaping character is % but not \. Otherwise, your trial is essentially correct:
print(string.gsub(str, '"(%d+)"', "%1"))
Where str is the variable containing the input string. Also note that string.gsub returns 2 values, which are both printed, the second result being the number of substitutions. Use an extra pair of parentheses to keep only the first result.
You can simplify a little the notation using the colon : operator :
print((str:gsub('"(%d+)"', "%1")))

Characters to separate value

i need to create a string to store couples of key/value data, for example:
key1::value1||key2::value2||key3::value3
in deserializing it, i may encounter an error if the key or the value happen to contain || or ::
What are common techniques to deal with such situation? thanks
A common way to deal with this is called an escape character or qualifier. Consider this Comma-Separated line:
Name,City,State
John Doe, Jr.,Anytown,CA
Because the name field contains a comma, it of course gets split improperly and so on.
If you enclose each data value by qualifiers, the parser knows when to ignore the delimiter, as in this example:
Name,City,State
"John Doe, Jr.",Anytown,CA
Qualifiers can be optional, used only on data fields that need it. Many implementations will use qualifiers on every field, needed or not.
You may want to implement something similar for your data encoding.
Escape || when serializing, and unescape it when deserializing. A common C-like way to escape is to prepend \. For example:
{ "a:b:c": "foo||bar", "asdf": "\\|||x||||:" }
serialize => "a\:b\:c:foo\|\|bar||asdf:\\\\\|\|\|x\|\|\|\|\:"
Note that \ needs to be escaped (and double escaped due to being placed in a C-style string).
If we assume that you have total control over the input string, then the common way of dealing with this problem is to use an escape character.
Typically, the backslash-\ character is used as an escape to say that "the next character is a special character", so in this case it should not be used as a delimiter. So the parser would see || and :: as delimiters, but would see \|\| as two pipe characters || in either the key or the value.
The next problem is that we have overloaded the backslash. The problem is then, "how do I represent a backslash". This is sovled by saying that the backslash is also escaped, so to represent a \, you would have to say \\. So the parser would see \\ as \.
Note that if you use escape characters, you can use a single character for the delimiters, which might make things simpler.
Alternatively, you may have to restict the input and say that || and :: are just baned and fail/remove when the string is encoded.
A simple solution is to escape a separator (with a backslash, for instance) any time it occurs in data:
Name,City,State
John Doe\, Jr.,Anytown,CA
Of course, the separator will need to be escaped when it occurs in data as well; in this case, a backslash would become \\.
You can use non-ascii character as separator (e.g. vertical tab :-) ).
You can escape separator character in your data during serialization. For example: if you use one character as separator (key1:value1|key2:value2|...) and your data is:
this:is:key1 this|is|data1
this:is:key2 this|is|data2
you double every colon and pipe character in you data when you serialize it. So you will get:
this::is::key1:this||is||data1|this::is::key2:this||is||data2|...
During deserialization whenever you come across two colon or two pipe characters you know that this is not your separator but part of your data and that you have to change it to one character. On the other hand, every single colon or pipe character is you separator.
Use a prefix (say "a") for your special characters (say "b") present in the key and values to store them. This is called escaping.
Then decode the key and values by simply replacing any "ab" sequence with "b". Bear in mind that the prefix is also a special character. An example:
Prefix: \
Special characters: :, |, \
Encoded:
title:Slashdot\: News for Nerds. Stuff that Matters.|shortTitle:\\.
Decoded:
title=Slashdot: News for Nerds. Stuff that Matters.
shortTitle=\.
The common technique is escaping reserved characters, for example:
In urls you escape some characters
using %HEX representation:
http://example.com?aa=a%20b
In programming languages you escape
some characters with a slash prefix:
"\"hello\""

New lines in tab delimited or comma delimtted output

I am looking for some best practices as far as handling csv and tab delimited files.
For CSV files I am already doing some formatting if a value contains a comma or double quote but what if the value contains a new line character? Should I leave the new line intact and encase the value in double quotes + escape any double quotes within the value?
Same question for tab delimited files. I assume the answer would be very similar if not the same.
Usually you keep \n unaltered while exploiting the fact that the newline char will be enclosed in a " " string. This doesn't create ambiguities but it's really ugly if you have to take a look to the file using a normal texteditor.
But it is how you should do since you don't escape anything inside a string in a CSV except for the double quote itself.
#Jack is right, that your best bet is to keep the \n unaltered, since you'll expect it inside of double-quotes if that is the case.
As with most things, I think consistency here is key. As far as I know, your values only need to be double-quoted if they span multiple lines, contain commas, or contain double-quotes. In some implementations I've seen, all values are escaped and double-quoted, since it makes the parsing algorithm simpler (there's never a question of escaping and double-quoting, and the reverse on reading the CSV).
This isn't the most space-optimized solution, but makes reading and writing the file a trivial affair, for both your own library and others that may consume it in the future.
For TSV, if you want lossless representation of values, the "Linear TSV" specification is worth considering: http://paulfitz.github.io/dataprotocols/linear-tsv/index.html
For obvious reasons, most such conventions adhere to the following at a minimum:
\n for newline,
\t for tab,
\r for carriage return,
\\ for backslash
Some tools add \0 for NUL.

Are quotes a type of string delimiter? Or does 'delimiter' mean other types of characters not including quotes?

When people talk about string delimiters, does that include quotes or does that mean everything except quotes?
It means any character used to define the beginning and end of a string (e.g. quotes but, in other contexts, other characters).
There's a subtle difference, if you're talking about string delimiters that nearly always means quotes, either " or '.
If you're talking about a delimited string, then you're normal talking about a string of tokens, with delimiters between them ie
"this,is,a,delimited,string" -
It's very common to use a comma, as the delimiter, but that leads to issues when the token already contains a comma - for instance
"one,million,dollars,$1,000,000"
In this instance it's common to further delimit the token so we get
"one,million,dollars,"$1,000,000""
another common alternative is to use an unusual character as the delimiter, and there's a minor convention to use the pipe symbol |
"one|million|dollars|$1,000,000"

Resources