Summary: The '{key:spec}'.format_map(dic) allows to format the value from the dic accessed by the key. The spec says how it should be formatted. However, what if I want the separating colon be the part of the key? How should I tell that the colon is not a separator and the next characters are not a specification?
Details: I use string templates for transforming XML attributes to another text. Say, I have attributes of an XML element in the attributes dictionary. One of them has the key 'xlink:href' (literal name of the attribute). When using .format_map() method, how the format string should be written?
The '{xlink:href}'.format_map(attributes) does not work. Python complains KeyError: 'xlink'. (The href would probably be considered a bad specification, but the exception stops further processing.)
There is no way to escape colon in {xlink:href}.
You can't specify arbitrary keys in the replacement field:
replacement_field ::= "{" [field_name] ["!" conversion] [":" format_spec] "}"
field_name ::= arg_name ("." attribute_name | "[" element_index "]")*
arg_name ::= [identifier | integer]
attribute_name ::= identifier
element_index ::= integer | index_string
index_string ::= <any source character except "]"> +
conversion ::= "r" | "s" | "a"
format_spec ::= <described in the next section>
Related
I need to match a token that can be combined from two parts:
"string" + any number; e.g. string64, string128, etc.
In the lexer rules I have
STRING: S T R I N G;
NUMERIC_LITERAL:
((DIGIT+ ('.' DIGIT*)?) | ('.' DIGIT+)) (E [-+]? DIGIT+)?
| '0x' HEX_DIGIT+;
In the parser, I defined
type_id_string: STRING NUMERIC_LITERAL;
However, the parser doesn't not match and stop at expecting STRING token
How do I tell the parser that token has two parts?
BR
You probably have some "identifier" rule like this:
ID
: [a-zA-Z_] [a-zA-Z0-9_]*
;
which will cause input like string64 to be tokenized as an ID token and not as a STRING and NUMERIC_LITERAL tokens.
Also trying to match these sort of things in a parser rule like:
type_id_string: STRING NUMERIC_LITERAL;
will go wrong when you're discarding white spaces in the lexer. If the input would then be "string 64" (string + space + 64) it could possible be matched by type_id_string, which is probably not what you want.
Either do:
type_id_string
: ID
;
or define these tokens in the lexer:
type_id_string
: ID
;
// Important to match this before the `ID` rule!
TYPE_ID_STRING
: [a-zA-Z] [0-9]+
;
ID
: [a-zA-Z_] [a-zA-Z0-9_]*
;
However, when doing that, input like fubar1 will also become a TYPE_ID_STRING and not an ID!
In the rust book I see the definition of MacroMatch is like the following
MacroMatch :
Token except $ and delimiters
| MacroMatcher
| $ ( IDENTIFIER_OR_KEYWORD except crate | RAW_IDENTIFIER | _ ) : MacroFragSpec
| $ ( MacroMatch+ ) MacroRepSep? MacroRepOp
MacroFragSpec :
block | expr | ident | item | lifetime | literal
| meta | pat | pat_param | path | stmt | tt | ty | vis
MacroRepSep :
Token except delimiters and MacroRepOp
MacroRepOp :
* | + | ?
According to the definition of tokens, I found >> is a token. So, in my understading, we can use all tokens as MacroRepSep except from {}/[]/{}/*/+/?.
However, the following codes can't compile, with a error "$a:expr is followed by >>, which is not allowed for expr fragments"
macro_rules! add_list2 {
($($a:expr)>>*) => {
0
$(+$a)*
}
}
pub fn main() {
println!("{}", add_list!(1>>2>>3));
}
Playground.
I wonder why, and if I can use a separator other than ,?
Not sure if you're using a different Rust version, but with your code on the current compiler (1.62) it outputs an error that includes what separators are available:
error: `$a:expr` is followed by `>>`, which is not allowed for `expr` fragments
--> src/main.rs:2:16
|
2 | ($($a:expr)>>*) => {
| ^^ not allowed after `expr` fragments
|
= note: allowed there are: `=>`, `,` or `;`
The problem with repetitions on exprs is that, since they are so varied, they can easily be ambiguous or could become ambiguous. I'll quote the section on Follow-set Ambiguity Restrictions:
The parser used by the macro system is reasonably powerful, but it is limited in order to prevent ambiguity in current or future versions of the language. In particular, in addition to the rule about ambiguous expansions, a nonterminal matched by a metavariable must be followed by a token which has been decided can be safely used after that kind of match.
As an example, a macro matcher like $i:expr [ , ] could in theory be accepted in Rust today, since [,] cannot be part of a legal expression and therefore the parse would always be unambiguous. However, because [ can start trailing expressions, [ is not a character which can safely be ruled out as coming after an expression. If [,] were accepted in a later version of Rust, this matcher would become ambiguous or would misparse, breaking working code. Matchers like $i:expr, or $i:expr; would be legal, however, because , and ; are legal expression separators.
And it includes the separators available for various fragment specifiers:
expr and stmt may only be followed by one of: =>, ,, or ;.
pat_param may only be followed by one of: =>, ,, =, |, if, or in.
pat may only be followed by one of: =>, ,, =, if, or in.
path and ty may only be followed by one of: =>, ,, =, |, ;, :, >, >>, [, {, as, where, or a macro variable of block fragment specifier.
vis may only be followed by one of: ,, an identifier other than a non-raw priv, any token that can begin a type, or a metavariable with a ident, ty, or path fragment specifier.
All other fragment specifiers have no restrictions.
first="harry"
last="potter"
print(first, first.title())
print(f"Full name: {first.title()} {last.title()}")
print("Full name: {0.title()} {1.title()}".format(first, last))
The first two statements works fine; which means there is attribute title() to 'str' object.
The third print statement gives error. Why is it so?
The str.format() syntax is different from f-string syntax. In particular, while f-strings essentially let you put any expression between the brackets, str.format() is considerably more limited. Per the documentation:
The grammar for a replacement field is as follows:
replacement_field ::= "{" [field_name] ["!" conversion] [":" format_spec] "}"
field_name ::= arg_name ("." attribute_name | "[" element_index "]")*
arg_name ::= [identifier | digit+]
attribute_name ::= identifier
element_index ::= digit+ | index_string
index_string ::= <any source character except "]"> +
conversion ::= "r" | "s" | "a"
format_spec ::= <described in the next section>
You'll note that, while attribute names (via the dot operator .) and indices (via square-brackets []) - in other words, values - are valid, actual method calls (or any other expressions) are not. I hypothesize this is because str.format() does not actually execute the text, but just swaps in an object that already exists.
Actual f-strings (your second example) share a similar syntax to the str.format() method, in that they use curly-brackets {} to indicate the areas to replace, but according to the PEP that introduced them,
F-strings provide a way to embed expressions inside string literals, using a minimal syntax. It should be noted that an f-string is really an expression evaluated at run time, not a constant value.
This is clearly different (more complex) than str.format(), which is more of a simple text replacement - an f-string is an expression and is executed as such, and allows full expressions inside its brackets (in fact, you can even nest f-strings inside each other, which is fun).
str.format() passes string object in respective placeholder. And by using '.' you can access the string attributes or functionalities. That is why {0.title()} searching for the specific method in the string class and it is getting nothing about title().
But if you use
print("Full name: {0.title} {1.title}".format(first, last))
>> Full name: <built-in method title of str object at 0x7f5e42d09630><built-in method title of str object at 0x7f5e42d096b0>
Here you can see you can access built-in method of string
If you want to use title() with format() then use like this:
print("Full name: {0} {1}".format(first.title(), last.title()))
>> Full name: Harry Potter
I'm trying to parse an existing language in ANTLR that's currently being parsed using the Ruby library Parslet.
Here is a stripped down version of my grammar:
grammar FilterMin;
filter : condition_set;
condition_set: condition_set_type (property_condition)?;
condition_set_type: '=' | '^=';
property_condition: property_lhs CONDITION_SEPARATOR property_rhs;
property_lhs: QUOTED_STRING;
property_rhs: entity_rhs | contains_rhs;
contains_rhs: CONTAINS_OP '(' contains_value ')';
contains_value: QUOTED_STRING;
entity_rhs: NOT_OP? MATCH_OP? QUOTED_STRING;
// operators
MATCH_OP: '~';
NOT_OP: '^';
CONTAINS_OP: 'contains';
QUOTED_STRING: QUOTE STRING QUOTE;
STRING: (~['\\])*;
QUOTE: '\'';
CONDITION_SEPARATOR: ':';
This parses fails to parse both ='foo':'bar' and ='foo':contains('bar') with the same either: mismatched input ':' expecting ':' or mismatched input ':contains(' expecting ':'.
Why aren't these inputs parsing?
Your STRING rule matches everything that isn't a backslash or a single quote. So it overlaps with all of your other lexical rules except QUOTED_STRING. Since the lexer will always pick the rule that produces the longest match and that's almost always STRING, your lexer will produce a bunch of STRING tokens and never any CONDITION_SEPERATOR tokens.
Since you never use STRING in your parser rules, it doesn't need to be an actual type of token. In fact, you never want STRING tokens to be generated, you only ever want it to be matched as part of a QUOTED_STRING token. Therefore it should be a fragment.
The specification for RDF N-Triples states that string literals must be encoded.
https://www.w3.org/TR/n-triples/#grammar-production-STRING_LITERAL_QUOTE
Does this "encoding" have a name I can look up to use it in my programming language? If not, what does it mean in practice?
The grammar productions that you need are right in the document that you linked to:
[9] STRING_LITERAL_QUOTE ::= '"' ([^#x22#x5C#xA#xD] | ECHAR | UCHAR)* '"'
[141s] BLANK_NODE_LABEL ::= '_:' (PN_CHARS_U | [0-9]) ((PN_CHARS | '.')* PN_CHARS)?
[10] UCHAR ::= '\u' HEX HEX HEX HEX | '\U' HEX HEX HEX HEX HEX HEX HEX HEX
[153s] ECHAR ::= '\' [tbnrf"'\]
This means that a string literal begins and ends with a double quote ("). Inside of the double quotes, you can have:
any character except: #x22, #x5C, #xA, #xD. Offhand, I don't know what each of those is, but I'd assume that they're the space characters covered in the escapes;
a unicode character represented with a \u followed by four hex digits, or a \U followed by eight hex digits; or
an escape character, which is a \ followed by any of t, b, n, r, f, ", ', and \, which represent various characters.
You could use Literal#n3()
e.g.
# pip install rdflib
>>> from rdflib import Literal
>>> lit = Literal('This "Literal" needs escaping!')
>>> s = lit.n3()
>>> print(s)
"This \"Literal\" needs escaping!"
In addition to Josh's answer. It is almost always a good idea to normalize unicode data to NFC,e.g. in Java you can use the following routine
java.text.Normalizer.normalize("rdf literal", Normalizer.Form.NFKC);
For more information see: http://www.macchiato.com/unicode/nfc-faq
What is NFC?
For various reasons, Unicode sometimes has multiple representations of the same character. For example, each of the following sequences (the first two being single-character sequences) represent the same character:
U+00C5 ( Å ) LATIN CAPITAL LETTER A WITH RING ABOVE
U+212B ( Å ) ANGSTROM SIGN
U+0041 ( A ) LATIN CAPITAL LETTER A + U+030A ( ̊ ) COMBINING RING ABOVE
These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for compostion. For more information on these, see the introduction of UAX #15: Unicode Normalization Forms. A function transforming a string S into the NFC form can be abbreviated as toNFC(S), while one that tests whether S is in NFC is abbreviated as isNFC(S).