ANTLR4: Lexer.getCharIndex() return value not behaving as expected

ANTLR4: Lexer.getCharIndex() return value not behaving as expected - antlr4

I want to extract specific fragment of lexer rule, so I wrote the following rule:
parser grammar TestParser;
options { tokenVocab=TestLexer; }
root
: LINE+ EOF
;
lexer grammar TestLexer;
#lexer::members {
private int startIndex = 0;
private void updateStartIndex() {
startIndex = getCharIndex();
}
private void printNumber() {
String number = _input.getText(Interval.of(startIndex, getCharIndex() - 1));
System.out.println(number);
}
}
LINE: {getCharPositionInLine() == 0}? ANSWER SPACE {updateStartIndex();} NUMBER {printNumber();} .+? NEWLINE;
OTHER: . -> skip;
fragment NUMBER: [0-9]+;
fragment ANSWER: '( ' [A-D] ' )';
fragment SPACE: ' ';
fragment NEWLINE: '\n';
fragment DOT: '.';
Execute the following code:
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.ParseTree;
public class TestParseTest {
public static void main(String[] args) {
CharStream charStream = CharStreams.fromString("( A ) 1. haha\n" +
"( B ) 12. hahaha\n" );
Lexer lexer = new TestLexer(charStream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
ParseTree parseTree = parser.root();
System.out.println(parseTree.toStringTree(parser));
}
}
The output is as follows:
1
12
(root ( A ) 1. haha\n ( B ) 12. hahaha\n <EOF>)
At this point, the value of the fragment NUMBER is printed as expected. Then I add the fragment DOT to the lexer rule LINE:
LINE: {getCharPositionInLine() == 0}? ANSWER SPACE {updateStartIndex();} NUMBER {printNumber();} DOT .+? NEWLINE;
The output of the above test code is as follows:
1
1
(root ( A ) 1. haha\n ( B ) 12. hahaha\n <EOF>)
Why does the second line of output change to 1, this is what I don't understand.
If we modify the test code as follows:
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.ParseTree;
public class TestParseTest {
public static void main(String[] args) {
CharStream charStream = CharStreams.fromString("( B ) 12. hahaha\n"+
"( B ) 123. hahaha\n");
Lexer lexer = new TestLexer(charStream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
ParseTree parseTree = parser.root();
System.out.println(parseTree.toStringTree(parser));
}
}
At this time, when LINE does not contain DOT, the output is as follows:
12
123
(root ( B ) 12. hahaha\n ( B ) 123. hahaha\n <EOF>)
When LINE contains DOT, the output is as follows:
12
12
(root ( B ) 12. hahaha\n ( B ) 123. hahaha\n <EOF>)
Update
I have submitted this issue to GitHub: Lexer.getCharIndex() return value not behaving as expected · Issue #3606 · antlr/antlr4 · GitHub

Related

apex parse csv that contains double quote in every single records

public static List<List<String>> parseCSV(String contents,Boolean skipHeaders) {
List<List<String>> allFields = new List<List<String>>();
// replace instances where a double quote begins a field containing a comma
// in this case you get a double quote followed by a doubled double quote
// do this for beginning and end of a field
contents = contents.replaceAll(',"""',',"DBLQT').replaceall('""",','DBLQT",');
// now replace all remaining double quotes - we do this so that we can reconstruct
// fields with commas inside assuming they begin and end with a double quote
contents = contents.replaceAll('""','DBLQT');
// we are not attempting to handle fields with a newline inside of them
// so, split on newline to get the spreadsheet rows
List<String> lines = new List<String>();
try {
lines = contents.split('\n');
} catch (System.ListException e) {
System.debug('Limits exceeded?' + e.getMessage());
}
Integer num = 0;
for(String line : lines) {
// check for blank CSV lines (only commas)
if (line.replaceAll(',','').trim().length() == 0) break;
List<String> fields = line.split(',');
List<String> cleanFields = new List<String>();
String compositeField;
Boolean makeCompositeField = false;
for(String field : fields) {
if (field.startsWith('"') && field.endsWith('"')) {
cleanFields.add(field.replaceAll('DBLQT','"'));
} else if (field.startsWith('"')) {
makeCompositeField = true;
compositeField = field;
} else if (field.endsWith('"')) {
compositeField += ',' + field;
cleanFields.add(compositeField.replaceAll('DBLQT','"'));
makeCompositeField = false;
} else if (makeCompositeField) {
compositeField += ',' + field;
} else {
cleanFields.add(field.replaceAll('DBLQT','"'));
}
}
allFields.add(cleanFields);
}
if(skipHeaders)allFields.remove(0);
return allFields;
}
I use this part to parse CSV file, but i find out i cant parse when the CSV are all bounded by double quotes.
For example, i have records like these
"a","b","c","d,e,f","g"
After parsing, i would like to get these
a b c d,e,f g

From what I'm seen, the first thing you do is to split the line you get from the CSV file by commas, using this line:
List < String > fields = line.split(',');
When you do this to your own example ("a","b","c","d,e,f","g"), what you get as your list of string is:
fields = ("a" | "b" | "c" | "d | e | f" | "g"), where the bar is used to separate the list elements
The issue here is that, if you first split by commas, it will be a little more difficult to differentiate those commas that are part of a field (because they actually appeared inside quotes), from those that separate fields in you CSV.
I suggest trying do split the line by quotes, which would give you something like this:
fields = (a | , | b | , | c | , | d, e, f | , | g)
and filter out any elements of you list that are only commas and/or spaces, finally achieving this:
fields = (a | b | c | d, e, f | g)
(edited)
Is that Java you're using?
Anyways, here is a Java code that does what you're trying to do:
import java.lang.*;
import java.util.*;
public class HelloWorld
{
public static ArrayList<ArrayList<String>> parseCSV(String contents,Boolean skipHeaders) {
ArrayList<ArrayList<String>> allFields = new ArrayList<ArrayList<String>>();
// separating the file in lines
List<String> lines = new ArrayList<String>();
lines = Arrays.asList(contents.split("\n"));
// ignoring header, if needed
if(skipHeaders) lines.remove(0);
// for each line
for(String line : lines) {
List<String> fields = Arrays.asList(line.split("\""));
ArrayList<String> cleanFields = new ArrayList<String>();
Boolean isComma = false;
for(String field : fields) {
// ignore elements that don't have useful data
// (every other element after splitting by quotes)
isComma = !isComma;
if (isComma) continue;
cleanFields.add(field);
}
allFields.add(cleanFields);
}
return allFields;
}
public static void main(String[] args)
{
// example of input file:
// Line 1: "a","b","c","d,e,f","g"
// Line 2: "a1","b1","c1","d1,e1,f1","g1"
ArrayList<ArrayList<String>> strings = HelloWorld.parseCSV("\"a\",\"b\",\"c\",\"d,e,f\",\"g\"\n\"a1\",\"b1\",\"c1\",\"d1,e1,f1\",\"g1\"",false);
System.out.println("Result:");
for (ArrayList<String> list : strings) {
System.out.println(" New List:");
for (String str : list) {
System.out.println(" - " + str);
}
}
}
}

Groovy String to CSV

I have a string input of the structure like:
[{name=John, dob=1970-07-27 00:00:00.0, score=81},
{name=Jane, dob=1970-07-28 00:00:00.0, score=77}]
I am trying to convert it to a CSV output. So expected output:
"name", "dob", "score"
"John", "1970-07-27 00:00:00.0", 81
"Jane", "1970-07-28 00:00:00.0", 77
So far I have tried something like
def js = "[{name=John, dob=1970-07-27 00:00:00.0, score=81}, {name=Jane, dob=1970-07-28 00:00:00.0, score=77}]"
def js = new JsonSlurper().parseText(s)
def columns = js*.keySet().flatten().unique()
def encode = { e -> e == null ? '' : e instanceof String ? /"$e"/ : "$e" }
// Print all the column names
println columns.collect { c -> encode( c ) }.join( ',' )
// Then create all the rows
println data.infile.collect { row ->
// A row at a time
columns.collect { colName -> encode( row[ colName ] ) }.join( ',' )
}.join( '\n' )
This fails with groovy.json.JsonException: expecting '}' or ',' but got current char 'n'
Defining var js without quotes fails with expecting '}', found ','
Thanks in advance

I wrote this simple parser which will go splitting lines. If your input changes (i.e., comma or = is used in any other place), you will need a more complex parser:
input = '''[{name=John, dob=1970-07-27 00:00:00.0, score=81},
{name=Jane, dob=1970-07-28 00:00:00.0, score=77},
{name=Test, dob=1980-01-01 00:00:00.0, score=90}]'''
maps = (input =~ /\{(.*)\}/)
.collect {
it[1]
.split(', ')
.collectEntries { entry ->
entry.split '=' with {
[it.first(), it.last()]
}
}
}
assert maps == [
[name:'John', dob:'1970-07-27 00:00:00.0', score:'81'],
[name:'Jane', dob:'1970-07-28 00:00:00.0', score:'77'],
[name:'Test', dob:'1980-01-01 00:00:00.0', score:'90']
]

You can try this code snippet
import groovy.json.JsonSlurper
def s = '[{"name":"John", "dob":"1970-07-27 00:00:00.0", "score":81},{"name":"Jane", "dob":"1970-07-28 00:00:00.0", "score":77}]'
def js = new JsonSlurper().parseText(s)
def columns = js*.keySet().flatten().unique()
def encode = { e -> e == null ? '' : e instanceof String ? /"$e"/ : "$e" }
println columns.collect { c -> encode(c) }.join(',')
js.each {
println it.values().join(",")
} 
I used Groovy JsonSlurper to convert String to Map, For this you need to give proper json String.

Get Palindrome of a String with replacement

Elina has a string S, consisting of lowercase English alphabetic letters(ie. a-z). She can replace any character in the string with any other character, and she can perform this replacement any number of times. she wants to create a palindromic string, p , from s such that string p contains the sub string linkedin . it is guaranteed that Elina can create palindromic string p from S.
find the minimum number of operation required to create palindromic string p from S.
Sample test case are:
First test case: S="linkedininininin"
explanation :
linkedin (i) (n) (i) (n) (i) ni (n)
(n) (i) (d) (e) (k) (l)
p = "linkedinnideknil"
output is 6
Second test case: S="fulrokxeuolnzxltiiniabudyyozvulqbydmaldbxaddmkobhlplkaplgndnksqidkaenxdacqtsskdkdddls"
output is 46
here i was unable to get second test case output, how it's getting output 46.
Third Test Case:
S="linkaeiouideknil"
P="linkedinnideknil"
Output = 4

Here is code with time complexity of O(n).
import java.io.*;
import java.util.*;
class TestClass {
public static void main(String args[] ) throws Exception {
Scanner sc = new Scanner(System.in);
String input = sc.next();
String ln = "linkedin";
String rln= "nideknil";
int limit, limit2;
int len = input.length();
if(len%2==0){
limit=len/2-7;
limit2=len/2-1;
}else{
limit=(len+1)/2-7;
limit2= (len-1)/2 -1;
}
int max=0,index=0;
boolean rev=false;
for(int i = 0; i<=len-8;i++){
int count1=0, count2=0;
if(i==limit){
if(len%2==0){
i=len/2;
}else{
i=(len-1)/2;
}
}
String temp=input.substring(i,i+8);
for(int j=0;j<8;j++){
if(ln.charAt(j)==temp.charAt(j)){
count1++;
}
if(rln.charAt(j)==temp.charAt(j)){
count2++;
}
int temp2 = count1 > count2 ? count1 : count2;
if(temp2>max){
index=i;
max=temp2;
if(temp2==count2){
rev=true;
}
else
rev=false;
}
}
}
int replace=0;
char in[]= input.toCharArray();
int i,j;
for(i= index,j=0;i<index+8;j++,i++){
if(rev){
if(rln.charAt(j)!=in[i]){
replace++;
in[i]=rln.charAt(j);
}
} else{
if(ln.charAt(j)!=in[i]){
replace++;
in[i]=ln.charAt(j);
}
}
}
for(j=0,i = len-1; j<=limit2 ;i--,j++){
if(in[i]!=in[j]){
replace++;
}
}
System.out.println(replace);
}
}

ANRLR4 lexer semantic predicate issue

I'm trying to use a semantic predicate in the lexer to look ahead one token but somehow I can't get it right. Here's what I have:
lexer grammar
lexer grammar TLLexer;
DirStart
: { getCharPositionInLine() == 0 }? '#dir'
;
DirEnd
: { getCharPositionInLine() == 0 }? '#end'
;
Cont
: 'contents' [ \t]* -> mode(CNT)
;
WS
: [ \t]+ -> channel(HIDDEN)
;
NL
: '\r'? '\n'
;
mode CNT;
CNT_DirEnd
: '#end' [ \t]* '\n'?
{ System.out.println("--matched end--"); }
;
CNT_LastLine
: ~ '\n'* '\n'
{ _input.LA(1) == CNT_DirEnd }? -> mode(DEFAULT_MODE)
;
CNT_Line
: ~ '\n'* '\n'
;
parser grammar
parser grammar TLParser;
options { tokenVocab = TLLexer; }
dirs
: ( dir
| NL
)*
;
dir
: DirStart Cont
contents
DirEnd
;
contents
: CNT_Line* CNT_LastLine
;
Essentially each line in the stuff in the CNT mode is free-form, but it never begins with #end followed by optional whitespace. Basically I want to keep matching the #end tag in the default lexer mode.
My test input is as follows:
#dir contents
..line..
#end
If I run this in grun I get the following
$ grun TL dirs test.txt
--matched end--
line 3:0 extraneous input '#end\n' expecting {CNT_LastLine, CNT_Line}
So clearly CNT_DirEnd gets matched, but somehow the predicate doesn't detect it.
I know that this this particular task doesn't require a semantic predicate, but that's just the part that doesn't work. The actual parser, while it may be written without the predicate, will be a lot less clean if I simply move the matching of the the #end tag into the mode CNT.
Thanks,
Kesha.

I think I figured it out. The member _input represents the characters of the original input, thus _input.LA returns characters, not lexer token IDs (is that the correct term?). Either way, the numbers returned by the lexer to the parser have nothing to do with the values returned by _input.LA, hence the predicate fails unless by some weird luck the character value returned by _input.LA(1) is equal to the lexer ID of CNT_DirEnd.
I modified the lexer as shown below and now it works, even though it is not as elegant as I hoped it would be (maybe someone knows a better way?)
lexer grammar TLLexer;
#lexer::members {
private static final String END_DIR = "#end";
private boolean isAtEndDir() {
StringBuilder sb = new StringBuilder();
int n = 1;
int ic;
// read characters until EOF
while ((ic = _input.LA(n++)) != -1) {
char c = (char) ic;
// we're interested in the next line only
if (c == '\n') break;
if (c == '\r') continue;
sb.append(c);
}
// Does the line begin with #end ?
if (sb.indexOf(END_DIR) != 0) return false;
// Is the #end followed by whitespace only?
for (int i = END_DIR.length(); i < sb.length(); i++) {
switch (sb.charAt(i)) {
case ' ':
case '\t':
continue;
default: return false;
}
}
return true;
}
}
[skipped .. nothing changed in the default mode]
mode CNT;
/* removed CNT_DirEnd */
CNT_LastLine
: ~ '\n'* '\n'
{ isAtEndDir() }? -> mode(DEFAULT_MODE)
;
CNT_Line
: ~ '\n'* '\n'
;

Does D have anything like Java's Scanner?

Is there a stream-parser in D like Java's scanner? Where you can just go nextInt() to fetch an int and nextLong() for a long, etc.

std.conv.parse is similar:
http://dlang.org/phobos/std_conv.html#parse
The example is a string, though it is also possible to use it with other character sources.
import std.conv;
import std.stdio;
void main() {
// a char source from the user
auto source = LockingTextReader(stdin);
int a = parse!int(source); // use it with parse
writeln("you wrote ", a);
// munch only works on actual strings so we have to advance
// this manually
for(; !source.empty; source.popFront()) {
auto ch = source.front;
if(ch != ' ' && ch != '\n')
break;
}
int b = parse!int(source);
writeln("then you wrote ", b);
}
$ ./test56
12 23
you wrote 12
then you wrote 23

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ANTLR4: Lexer.getCharIndex() return value not behaving as expected - antlr4

Related

apex parse csv that contains double quote in every single records

Groovy String to CSV

Get Palindrome of a String with replacement

ANRLR4 lexer semantic predicate issue

Does D have anything like Java's Scanner?

Categories

Resources