Some Request for Comments (RFCs) have these two rules:
1. Each comma character must be escaped with a backslash.
2. A backslash that is not being used to escape a comma must be escaped with a backslash.
Here are valid values:
A\,B
A\\\,B
A\\\\\\\,B
Here are invalid values:
A\\,B
A,B
A\\\B
I created two Alloy models.
My second Alloy model has two signatures for backslashes:
sig Backslash extends Char {}
sig EscapedBackslash extends Char {}
The latter, of course, represents a double backslash. The former represents a single backslash.
Given those two signatures, it was really easy to express the rules:
Each comma must be preceded by a backslash.
Each backslash must be followed by a comma.
I didn't need to worry about making sure certain backslashes were appropriately escaped. The EscapedBackslash signature already took care of that.
In my first model I had only one signature for backslash:
sig Backslash extends Char {}
It was very hard to implement the two rules. In fact, I never succeeded in implementing the rules.
As I stated earlier, the second model has this signature:
sig EscapedBackslash extends Char {}
which enabled me to completely avoid the task of checking that every backslash that is not being used to escape a comma is escaped. Is that cheating?
Here is my Alloy model (second model):
one sig Text {
firstChar: Char
}
abstract sig Char {
next: lone Char,
prev: lone Char
}
sig A extends Char {}
sig B extends Char {}
sig C extends Char {}
sig Comma extends Char {}
sig Backslash extends Char {}
sig EscapedBackslash extends Char {}
fact Text_Structure {
// If the i'th character is c and c.next is c', then c'.prev equals c
all c: Char | some c.next => c.next.prev = c
// The first character has no previous character
no Text.firstChar.prev
// No cycles in the forward direction, i.e., if the i'th character is c, then the
// i'th + n character cannot be c
no c: Char | c in c.^next
// No cycles in the backward direction, i.e., if the i'th character is c, then the
// i'th - n character cannot be c
no c: Char | c in c.^prev
// This is not necesssary. I just don't like to see miscellaneous relations.
// This says that if a character is not in the string, then it has no next or previous character.
all c: Char | c not in Text.firstChar.*next => no c.next and no c.prev
}
// Rule: Every comma MUST be escaped.
fact Every_Comma_Escaped {
// The firstChar cannot be a comma
Text.firstChar not in Comma
all c: Text.firstChar.*next | c in Comma => c.prev in Backslash
}
// Rule: If a backslash is not being used to escape a character,
// then the backslash MUST be escaped, i.e., the single backslash
// must occur only when preceding a comma.
fact Every_Literal_Backslash_Escaped {
all c: Text.firstChar.*next | (c in Backslash) => (some c.next) and (c.next in Text.firstChar.*next) and (c.next in Comma)
}
Related
I need to implement a parser for this type of logic:the specified grammar
The S character is the initial character of the grammar; L, T, R, V, K, D, F, and E denote nonterminal characters. The terminal character c corresponds to one of the two scalar types specified in the task. The terminal character t corresponds to one of the data types that can be described in the type section.
I created the following grammar:
grammar Parse;
compileString: S+;
S: TYPE L VAR R;
L: T (SEPARATOR|SEPARATOR L);
R: V (SEPARATOR|SEPARATOR R);
V: [a-zA-Z] ([a-zA-Z]| [0-9]|'_')* DEFINITION (D|C);
T: D|C;
TYPE:'type';
VAR:'var';
D: // acceptable data types
'struct'
| 'union'
| 'array'
;
C: 'byte'
|'word' //scalar type
;
SEPARATOR:';';
DEFINITION :':';
WS : [ \t\n\r]+ -> skip ; // whitespaces
But when I try to execute it for the construction: "type byte; var p1:word;", I get the following output:
Tokens:
[#0,0:3='type',<6>,1:0]
[#1,5:9='byte;',<2>,1:5]
[#2,11:13='var',<7>,1:11]
[#3,15:22='p1:word;',<3>,1:15]
[#4,23:22='<EOF>',<-1>,1:23]
Parse Tree:
compileString (
<Error>"type"
<Error>"byte;"
<Error>"var"
<Error>"p1:word;"
)
I do not understand what the problem may be, debugging was performed in VS Code with a plugin from Antlr. I will be glad to any answer!
In ANTLR lexer rules start with capital letters and parser rules with lower case letters. So all of your rules except compileString are lexer rules.
S: TYPE L VAR R; does not match the input type byte; var p1:word; because there are spaces in it and nothing in the definition of S matches spaces. You're probably thinking that shouldn't matter because you're skipping spaces, but tokens are only skipped between lexer rules not inside of them. So it would work if S were a parser rule, but not as a lexer rule.
The same applies to spaces between the separator and L/R in L and R.
PS: I strongly suggest to give your rules longer names as it is quite hard to follow your grammar. You might also consider using the + operator in L and R instead of recursion.
for a string like "{foo}{bar}" is there an easy
str = "{foo}{bar}"
first, second = str:gmatch(...)...
should give first="foo" and second="bar"
The problem is that foo itself can have some more parentheses, eg:
str = "{foo {baz}{bar}"
so that first = "foo {baz" The
bar part has only alphanumerical characters, no parentheses
You may use
first, second = str:match('{([^}]*)}%s*{([^}]*)}')
See the Lua demo online
The str.match function will find and return the first match and since there are two capturing groups there will be two values returned upon a valid match.
The pattern means:
{ - a { char
([^}]*) - Group 1: any 0+ chars other than }
} - a } char
%s* - 0+ whitespaces (not necessary, but a bonus)
{([^}]*)} - same as above, just there is a Group 2 defined here.
I am new to the concept of lexing and am trying to write a lexer in ocaml to read the following example input:
(blue, 4, dog, 15)
Basically the input is a list of any random string or integer. I have found many examples for int based inputs as most of them model a calculator, but have not found any guidance through examples or the documentation for lexing strings. Here is what I have so far as my lexer:
(* File lexer.mll *)
{
open Parser
}
rule lexer_main = parse
[' ' '\r' '\t'] { lexer_main lexbuf } (* skip blanks *)
| ['0'-'9']+ as lxm { INT(int_of_string lxm) }
| '(' { LPAREN }
| ')' { RPAREN }
| ',' { COMMA }
| eof { EOF }
| _ { syntax_error "couldn't identify the token" }
As you can see I am missing the ability to parse strings. I am aware that a string can be represented in the form ['a'-'z'] so would it be as simple as ['a'-'z'] { STRING }
Thanks for your help.
The notation ['a'-'z'] represents a single character, not a string. So a string is more or less a sequence of one or more of those. I have a fear that this is an assignment, so I'll just say that you can extend a pattern for a single character into a pattern for a sequence of the same kind of character using the same technique you're using for INT.
However, I wonder whether you really want your strings to be so restrictive. Are they really required to consist of alphabetic characters only?
I am trying to write a grammar that will match the finite closure pattern for regular expressions ( i.e foo{1,3} matches 1 to 3 'o' appearances after the 'fo' prefix )
To identify the string {x,y} as finite closure it must not include spaces for example { 1, 3} is recognized as a sequence of seven characters.
I have written the following lexer and parser file but i am not sure if this is the best solution. I am using a lexical mode for the closure pattern which is activated when a regular expression matches a valid closure expression.
lexer grammar closure_lexer;
#header { using System;
using System.IO; }
#lexer::members{
public static bool guard = true;
public static int LBindex = 0;
}
OTHER : .;
NL : '\r'? '\n' ;
CLOSURE_FLAG : {guard}? {LBindex =InputStream.Index; }
'{' INTEGER ( ',' INTEGER? )? '}'
{ closure_lexer.guard = false;
// Go back to the opening brace
InputStream.Seek(LBindex);
Console.WriteLine("Enter Closure Mode");
Mode(CLOSURE);
} -> skip
;
mode CLOSURE;
LB : '{';
RB : '}' { closure_lexer.guard = true;
Mode(0); Console.WriteLine("Enter Default Mode"); };
COMMA : ',' ;
NUMBER : INTEGER ;
fragment INTEGER : [1-9][0-9]*;
and the parser grammar
parser grammar closure_parser;
#header { using System;
using System.IO; }
options { tokenVocab = closure_lexer; }
compileUnit
: ( other {Console.WriteLine("OTHER: {0}",$other.text);} |
closure {Console.WriteLine("CLOSURE: {0}",$closure.text);} )+
;
other : ( OTHER | NL )+;
closure : LB NUMBER (COMMA NUMBER?)? RB;
Is there a better way to handle this situation?
Thanks in advance
This looks quite complex for such a simple task. You can easily let your lexer match one construct (preferably that without whitespaces, if you usually skip them) and the parser matches the other form. You don't even need lexer modes for that.
Define your closure rule:
CLOSURE
: OPEN_CURLY INTEGER (COMMA INTEGER?)? CLOSE_CURLY
;
This rule will not match any form that contains e.g. whitespaces. So, if your lexer does not match CLOSURE you will get all the individual tokens like the curly braces and integers ending up in your parser for matching (where you then can treat them as something different).
NB: doesn't the closure definition also allow {,n} (same as {n})? That requires an additional alt in the CLOSURE rule.
And finally a hint: your OTHER rule will probably give you trouble as it matches any char and is even located before other rules. If you have a whildcard rule then it should be the last in your grammar, matching everything not matched by any other rule.
I saw the operator r#"" in Rust but I can't find what it does. It came in handy for creating JSON:
let var1 = "test1";
let json = r#"{"type": "type1", "type2": var1}"#;
println!("{}", json) // => {"type2": "type1", "type2": var1}
What's the name of the operator r#""? How do I make var1 evaluate?
I can't find what it does
It has to do with string literals and raw strings. I think it is explained pretty well in this part of the documentation, in the code block that is posted there you can see what it does:
"foo"; r"foo"; // foo
"\"foo\""; r#""foo""#; // "foo"
"foo #\"# bar";
r##"foo #"# bar"##; // foo #"# bar
"\x52"; "R"; r"R"; // R
"\\x52"; r"\x52"; // \x52
It negates the need to escape special characters inside the string.
The r character at the start of a string literal denotes a raw string literal. It's not an operator, but rather a prefix.
In a normal string literal, there are some characters that you need to escape to make them part of the string, such as " and \. The " character needs to be escaped because it would otherwise terminate the string, and the \ needs to be escaped because it is the escape character.
In raw string literals, you can put an arbitrary number of # symbols between the r and the opening ". To close the raw string literal, you must have a closing ", followed by the same number of # characters as there are at the start. With zero or more # characters, you can put literal \ characters in the string (\ characters do not have any special meaning). With one or more # characters, you can put literal " characters in the string. If you need a " followed by a sequence of # characters in the string, just use the same number of # characters plus one to delimit the string. For example: r##"foo #"# bar"## represents the string foo #"# bar. The literal doesn't stop at the quote in the middle, because it's only followed by one #, whereas the literal was started with two #.
To answer the last part of your question, there's no way to have a string literal that evaluates variables in the current scope. Some languages, such as PHP, support that, but not Rust. You should consider using the format! macro instead. Note that for JSON, you'll still need to double the braces, even in a raw string literal, because the string is interpreted by the macro.
fn main() {
let var1 = "test1";
let json = format!(r#"{{"type": "type1", "type2": {}}}"#, var1);
println!("{}", json) // => {"type2": "type1", "type2": test1}
}
If you need to generate a lot of JSON, there are many crates that will make it easier for you. In particular, with serde_json, you can define regular Rust structs or enums and have them serialized automatically to JSON.
The first time I saw this weird notation is in glium tutorials (old crate for graphics management) and is used to "encapsulate" and pass GLSL code (GL Shading language) to shaders of the GPU
https://github.com/glium/glium/blob/master/book/tuto-02-triangle.md
As far as I understand, it looks like the content of r#...# is left untouched, it is not interpreted in any way. Hence raw string.