Antlr4 doesn't correctly recognizes unicode characters

Antlr4 doesn't correctly recognizes unicode characters - antlr4

I've very simple grammar which tries to match 'é' to token E_CODE.
I've tested it using TestRig tool (with -tokens option), but parser can't correctly match it.
My input file was encoded in UTF-8 without BOM and I've used ANTLR version 4.4.
Could somebody else also check this ? I got this output on my console:
line 1:0 token recognition error at: 'Ă'
grammar Unicode;
stat:EOF;
E_CODE: '\u00E9' | 'é';

I tested the grammar:
grammar Unicode;
stat: E_CODE* EOF;
E_CODE: '\u00E9' | 'é';
as follows:
UnicodeLexer lexer = new UnicodeLexer(new ANTLRInputStream("\u00E9é"));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());
and the following got printed to my console:
éé<EOF>
Tested with 4.2 and 4.3 (4.4 isn't in Maven Central yet).
EDIT
Looking at the source I see TestRig takes an optional -encoding param. Have you tried setting it?

This is not an answer but a large comment.
I just hit a snag with Unicode, so I thought I would test this. Turned out I wrongly encoded the input file, but here is the test code, everything is default and working extremely well in ANTLR 4.10.1. Maybe of some use:
grammar LetterNumbers;
text: WORD*;
WS: [ \t\r\n]+ -> skip ; // toss out whitespace
// The letters that return Character.LETTER_NUMBER to Character.getType(ch)
// The list: https://www.compart.com/en/unicode/category/Nl
// Roman Numerals are the best known here
WORD: LETTER_NUMBER+;
LETTER_NUMBER:
[\u16ee-\u16f0]|[\u2160-\u2182]|[\u2185-\u2188]
|'\u3007'
|[\u3021-\u3029]|[\u3038-\u303a]|[\ua6e6-\ua6ef];
And the JUnit5 test that goes with that:
package antlerization.minitest;
import antlrgen.minitest.LetterNumbersBaseListener;
import antlrgen.minitest.LetterNumbersLexer;
import antlrgen.minitest.LetterNumbersParser;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.TerminalNode;
import org.junit.jupiter.api.Test;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTree;
import org.antlr.v4.runtime.tree.ParseTreeWalker;
import java.util.LinkedList;
import java.util.List;
import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.Matchers.*;
public class MiniTest {
static class WordCollector extends LetterNumbersBaseListener {
public final List<String> collected = new LinkedList<>();
#Override
public void exitText(LetterNumbersParser.TextContext ctx) {
for (TerminalNode tn : ctx.getTokens(LetterNumbersLexer.WORD)) {
collected.add(tn.getText());
}
}
}
private static ParseTree stringToParseTree(String inString) {
Lexer lexer = new LetterNumbersLexer(CharStreams.fromString(inString));
CommonTokenStream tokens = new CommonTokenStream(lexer);
// "text" is the root of the grammar tree
// this returns a sublcass of ParseTree: LetterNumbersParser.TextContext
return (new LetterNumbersParser(tokens)).text();
}
private static List<String> collectWords(ParseTree parseTree) {
WordCollector wc = new WordCollector();
(new ParseTreeWalker()).walk(wc, parseTree);
return wc.collected;
}
private static String joinForTest(List<String> list) {
return String.join(",",list);
}
private static String stringInToStringOut(String parseThis) {
return joinForTest(collectWords(stringToParseTree(parseThis)));
}
#Test
void unicodeCharsOneWord() {
String res = stringInToStringOut("ⅣⅢⅤⅢ");
assertThat(res,equalTo("ⅣⅢⅤⅢ"));
}
#Test
void escapesOneWord() {
String res = stringInToStringOut("\u2163\u2162\u2164\u2162");
assertThat(res,equalTo("ⅣⅢⅤⅢ"));
}
#Test
void unicodeCharsMultipleWords() {
String res = stringInToStringOut("ⅠⅡⅢ ⅣⅤⅥ ⅦⅧⅨ ⅩⅪⅫ ⅬⅭⅮⅯ");
assertThat(res,equalTo("ⅠⅡⅢ,ⅣⅤⅥ,ⅦⅧⅨ,ⅩⅪⅫ,ⅬⅭⅮⅯ"));
}
#Test
void unicodeCharsLetters() {
String res = stringInToStringOut("Ⅰ Ⅱ Ⅲ \n Ⅳ Ⅴ Ⅵ \n Ⅶ Ⅷ Ⅸ \n Ⅹ Ⅺ Ⅻ \n Ⅼ Ⅽ Ⅾ Ⅿ");
assertThat(res,equalTo("Ⅰ,Ⅱ,Ⅲ,Ⅳ,Ⅴ,Ⅵ,Ⅶ,Ⅷ,Ⅸ,Ⅹ,Ⅺ,Ⅻ,Ⅼ,Ⅽ,Ⅾ,Ⅿ"));
}
}

Your grammar file is not saved in utf8 format.
Utf8 is default format that antlr accept as input grammar file, according with terence Parr book.

Related

Scanners method "nextLine" doesn`t allow to input line

import java.io.IOException;
import java.util.Scanner;
public class MainTraining {
public static void main(String[] args) throws IOException {
Scanner scanner = new Scanner(System.in);
int countOfStrangers = scanner.nextInt();
String[] name = new String[countOfStrangers];
System.out.println(scanner.nextLine());
}
}
It will output nothing.
Why?
that is only first part of my programm, so that separately this code has no sense

As i undersood, there are some troubles if you use nextLine after nextInt. To solve this problem you can read int with nextLine and after that parse it to int. Also you can create two scanners: one for ints and one for Strings
Full answer is here: Why can't I enter a string in Scanner(System.in), when calling nextLine()-method?

Groovy: Proper way to create/use this class

I am loading a groovy script and I need the following class:
import java.util.function.Function
import java.util.function.Supplier
class MaskingPrintStream extends PrintStream {
private final Function<String,String> subFunction
private final Supplier<String> secretText
public MaskingPrintStream(PrintStream out, Supplier<String> secretText, Function<String,String> subFunction) {
super(out);
this.subFunction = subFunction;
this.secretText = secretText;
}
#Override
public void write(byte b[], int off, int len) {
String out = new String(b,off,len);
String secret = secretText.get();
byte[] dump = out.replace(secret, subFunction.apply(secret)).getBytes();
super.write(dump,0,dump.length);
}
}
right now I have it in a file called MaskingPrintStream.groovy. But, by doing this I effectively can only access this class as an inner class of the class that gets created by default that corresponds to the file name.
What I want to work is code more like this:
def stream = evaluate(new File(ClassLoader.getSystemResource('MaskingPrintStream.groovy').file))
But as you can see, I need to give it some values before it's ready. Perhaps I could load the class into the JVM (not sure how from another groovy script) and then instantiate it the old-fashioned way?
The other problem is: How do I set it up so I don't have this nested class arrangement?

groovy.lang.Script#evaluate(java.lang.String) has a bit different purpose.
For your case you need to use groovy.lang.GroovyClassLoader that is able to parse groovy classes from source, compile and load them. Here's example for your code:
def groovyClassLoader = new GroovyClassLoader(this.class.classLoader) // keep for loading other classes as well
groovyClassLoader.parseClass('MaskingPrintStream.groovy')
def buf = new ByteArrayOutputStream()
def maskingStream = new MaskingPrintStream(new PrintStream(buf), { 'secret' }, { 'XXXXX' })
maskingStream.with {
append 'some text '
append 'secret '
append 'super-duper secret '
append 'other text'
}
maskingStream.close()
println "buf = ${buf}"
And the output it produces in bash shell:
> ./my-script.groovy
buf = some text XXXXX super-duper XXXXX other text

eliminating embedded actions from antlr4 grammar

I have an antlr grammar in which embedded actions are used to collect data bottom up and build aggregated data structures. A short version is given below, where the aggregated data structures are only printed (ie no classes are created for them in this short sample code).
grammar Sample;
top returns [ArrayList l]
#init { $l = new ArrayList<String>(); }
: (mid { $l.add($mid.s); } )* ;
mid returns [String s]
: i1=identifier 'hello' i2=identifier
{ $s = $i1.s + " bye " + $i2.s; }
;
identifier returns [String s]
: ID { $s = $ID.getText(); } ;
ID : [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;
Its corresponding Main program is:
public class Main {
public static void main( String[] args) throws Exception
{
SampleLexer lexer = new SampleLexer( new ANTLRFileStream(args[0]));
CommonTokenStream tokens = new CommonTokenStream( lexer );
SampleParser parser = new SampleParser( tokens );
ArrayList<String> top = parser.top().l;
System.out.println(top);
}
}
And a sample test is:
aaa hello bbb
xyz hello pqr
Since one of the objectives of antlr is to keep the grammar file reusable and action-independent, I am trying to delete the actions from this file and move it to a tree walker. I took a first stab at it with the following code:
public class Main {
public static void main( String[] args) throws Exception
{
SampleLexer lexer = new SampleLexer( new ANTLRFileStream(args[0]));
CommonTokenStream tokens = new CommonTokenStream( lexer );
SampleParser parser = new SampleParser( tokens );
ParseTree tree = parser.top();
ParseTreeWalker walker = new ParseTreeWalker();
walker.walk( new Walker(), tree );
}
}
public class Walker extends SampleBaseListener {
public void exitTop(SampleParser.TopContext ctx ) {
System.out.println( "Exit Top : " + ctx.mid() );
}
public String exitMid(SampleParser.MidContext ctx ) {
return ctx.identifier() + " bye "; // ignoring the 2nd instance here
}
public String exitIdentifier(SampleParser.IdentifierContext ctx ) {
return ctx.ID().getText() ;
}
}
But obviously this is wrong, because at the least, the return types of the Walker methods should be void, so they dont have a way to return aggregated values upstream. Secondly, I dont see a way how to access the "i1" and "i2" from the walker code, so I am not able to differentiate between the two instances of "identifier" in that rule.
Any suggestions on how to separate the actions from the grammar for this purpose?
Should I use a visitor instead of a listener here, since the visitor has the capability of returning values? If I use a visitor, how do I solve the problem of differentiating between "i1" and "i2" (as mentioned above)?
Does a visitor perform its action only at the exit of a rule (unlike the listeners, which exist for both entry and exit)? For example, if I have to initialize the list at the entry of rule "top", how can I do it with a visitor, which executes only at the conclusion of a rule? Do I need a enterTop listener for that purpose?
EDIT: After the initial post, I have modified the rule "top" to create and return a list, and pass this list back to the main program for printing. This is to illustrate why I need an initialization mechanism for the code.

Based on what you are trying to do I think you may benefit from using ANTLR's BaseVisitor Class rather than the BaseListener Class.
Assuming your grammar is this (I generalized it and I'll explain the changes below):
grammar Sample;
top : mid* ;
mid : i1=identifier 'hello' i2=identifier ;
identifier : ID ;
ID : [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;
Then your Walker would look like this:
public class Walker extends SampleBaseVisitor<Object> {
public ArrayList<String> visitTop(SampleParser.TopContext ctx) {
ArrayList<String> arrayList = new ArrayList<>();
for (SampleParser.MidContext midCtx : ctx.mid()) {
arrayList.add(visitMid(midCtx));
}
return arrayList;
}
public String visitMid(SampleParser.MidContext ctx) {
return visitIdentifier(ctx.i1) + " bye " + visitIdentifier(ctx.i2);
}
public String visitIdentifier(SampleParser.IdentifierContext ctx) {
return ctx.getText();
}
}
This allows you to visit and get the result of any rule you want.
You are able to access i1 and i2, as you labeled them through the visitor methods. Note that you don't really need the identifier rule since it contains only one token and you can access a token's text directly in the visitMid, but really it's personal preference.
You should also note that SampleBaseVisitor is a generic class, where the generic parameter determines the return type of the visit methods. For your example I set the generic parameter Object, but you could even make your own class which contains the information you want to preserve and use that for your generic parameter.
Here are some more useful methods which BaseVisitor inherits which may help you out.
Lastly, your main method would end up looking something like this:
public static void main( String[] args) throws IOException {
FileInputStream fileInputStream = new FileInputStream(args[0]);
SampleLexer lexer = new SampleLexer(CharStreams.fromStream(fileInputStream));
CommonTokenStream tokens = new CommonTokenStream(lexer);
SampleParser parser = new SampleParser(tokens);
for (String string : new Walker().visitTop(parser.top())) {
System.out.println(string);
}
}
As a side note, the ANTLRFileStream class is deprecated in ANTLR4.
It is recommend to use CharStreams instead.

As Terence Parr points out in the Definitive Reference, one main difference between Visitor and Listener is that the Visitor can return values. And that can be convenient. But Listener has a place too! What I do for listener is exemplified in this answer. Granted, there are simpler ways of parsing a list of numbers, but I made that answer to show a complete and working example of how to aggregate return values from a listener into a public data structure that can be consumed later.
public class ValuesListener : ValuesBaseListener
{
public List<double> doubles = new List<double>(); // <<=== SEE HERE
public override void ExitNumber(ValuesParser.NumberContext context)
{
doubles.Add(Convert.ToDouble(context.GetChild(0).GetText()));
}
}
Looking closely at the Listener class, I include a public data collection -- a List<double> in this case -- to collect values parsed or calculated in the listener events. You can use any data structure you like: another custom class, a list, a queue, a stack (great for calculations and expression evaluation), whatever you like.
So while the Visitor is arguably more flexible, the Listener is a strong contender too, depending on how you want to aggregate your results.

How to convert String tokens into CoreLabel instances (StanfordNLP)?

Can we convert a String token into a CoreLabel instance?
So far, I am using:
CoreLabelTokenFactory c = new CoreLabelTokenFactory();
CoreLabel tokens = c.makeToken("going",0,"going".length());
The string gets converted, however with this approach, CoreLabel is not working in finding lemma's and pos.

Here is some sample code to demonstrate going from a raw String to an Annotation object:
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.CoreAnnotations.*;
import edu.stanford.nlp.util.*;
public class TokenizeExample {
public static void main (String[] args) throws IOException {
String text = "Here is a sentence. Here is another sentence.";
Annotation document = new Annotation(text);
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
for (CoreLabel cl : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
System.out.println("---");
System.out.println(cl);
System.out.println(cl.get(CoreAnnotations.PartOfSpeechAnnotation.class));
System.out.println(cl.get(CoreAnnotations.LemmaAnnotation.class));
}
}
}
}
Make sure you get Stanford CoreNLP 3.5.2 from here: http://nlp.stanford.edu/software/corenlp.shtml

Special character escaping

All groovy special character #{\'}${"}/', needs to be replaced by \ in front in a groovy string dynamically
input : anish$spe{cial
output : anish\$spe\{cial
input : anish}stack{overflow'
output : anish\}stack\{overflow\'
I have written a sample program in Java, that i want in groovier way
import java.util.regex.*;
import java.io.*;
/**
*
* #author anish
*
*/
public class EscapeSpecialChar {
public static void main(String[] args) throws IOException {
inputString();
}
private static void inputString() throws IOException {
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
System.out.print("Enter string to find special characters: ");
String string = in.readLine();
// Escape the pattern
string = escapeRE(string);
System.out.println("output: -- " + string);
}
// Returns a pattern where all punctuation characters are escaped.
static Pattern escaper = Pattern.compile("([^a-zA-z0-9])");
public static String escapeRE(String str) {
return escaper.matcher(str).replaceAll("\\\\$1");
}
}
Enter string to find special characters: $Anish(Stack%1231+#$124{}
output: -- \$Anish\(Stack\%1231\+\#\$124\{\}

This does what your Java code does:
System.console().with {
def inStr = readLine 'Enter string to find special characters: '
def outStr = inStr.replaceAll( /([^a-zA-Z0-9])/, '\\\\$1' )
println "Output: $outStr"
}
I am still dubious that what I think you are doing is a good idea though... ;-)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Antlr4 doesn't correctly recognizes unicode characters - antlr4

Your grammar file is not saved in utf8 format. Utf8 is default format that antlr accept as input grammar file, according with terence Parr book.

Related

Scanners method "nextLine" doesn`t allow to input line

Groovy: Proper way to create/use this class

eliminating embedded actions from antlr4 grammar

How to convert String tokens into CoreLabel instances (StanfordNLP)?

Special character escaping

Categories

Resources