ANTLR4 catch an entire line of arbitrary data

ANTLR4 catch an entire line of arbitrary data - antlr4

I have a grammar with command lines starting with a / and "data lines" which is everything that does not start with a slash.
I just can't get it to be parsed correctly, the following rule
FM_DATA: ( ('\r' | '\n' | '\r\n') ~'/') -> mode(DATA_MODE);
does almost what I need but for a data line of
abcde
the following tokens are generated
[#23,170:171='\na',<4>,4:72]
[#24,172:175='bcde',<103>,5:1]
so the first character is swallowed by the rule.
I also tried
FM_DATA: ( {getCharPositionInLine() == 0}? ~'/') -> mode(DATA_MODE);
but this causes even weirder things.
What's the correct rule for getting this to work as expected ?
TIA - Alex

The ... -> more command can be used to let the first char (or first part of a lexer rule) not be consumed (yet).
A quick demo:
lexer grammar FmDataLexer;
NewLine
: [\r\n]+ -> skip
;
CommandStart
: '/' -> pushMode(CommandMode)
;
FmDataStart
: . -> more, pushMode(FmDataMode)
;
mode CommandMode;
CommandLine
: ~[\r\n]+ -> popMode
;
mode FmDataMode;
FmData
: ~[\r\n]+ -> popMode
;
If you run the following code:
FmDataLexer lexer = new FmDataLexer(CharStreams.fromString("abcde\n/mu"));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
for (Token t : stream.getTokens()) {
System.out.printf("%-20s '%s'\n", FmDataLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
you'll get this output:
FmData 'abcde'
CommandStart '/'
CommandLine 'mu'
EOF '<EOF>'
See: https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md#mode-pushmode-popmode-and-more

Related

ANTLR4 handling continuations for "any data"

The grammar I need to create is based on the following:
Command lines start with a slash
Command lines can be continued with a hyphen as the last character
(excluding whitespaces) on a line
For some commands I want to parse their parameters
For other commands I am not interested in their parameters
This works almost fine with the following (simplified) Lexer
lexer grammar T1Lexer;
NewLine
: [\r\n]+ -> skip
;
CommandStart
: '/' -> pushMode(CommandMode)
;
DataStart
: . -> more, pushMode(DataMode)
;
mode DataMode;
DataLine
: ~[\r\n]+ -> popMode
;
mode CommandMode;
CmNL
: [\r\n]+ -> skip, popMode
;
CONTINUEMINUS : ( '-' [ ]* ('\r/' | '\n/' | '\r\n/') ) -> channel(HIDDEN);
EOL: ( [ ]* ('\r' | '\n' | '\r\n') ) -> popMode;
SPACE : [ \t\r\n]+ -> channel(HIDDEN) ;
DOT : [.] ;
COMMA : ',' ;
CMD1 : 'CMD1';
CMD2 : 'CMD2';
CMDIGN : 'CMDIGN' -> pushMode(DataMode) ;
VAR1 : 'VAR1=' ;
ID : ID_LITERAL;
fragment ID_LITERAL: [A-Z_$0-9]*?[A-Z_$]+?[A-Z_$0-9]*;
and Parser:
parser grammar T1Parser;
options { tokenVocab=T1Lexer; }
root : line+ EOF ;
line: ( commandLine | dataLine)+ ;
dataLine : DataLine ;
commandLine : CommandStart command ;
command : cmd1 | cmd2 | cmdign ;
cmd1 : CMD1 (VAR1 ID)+ ;
cmd2 : CMD2 (VAR1 ID)+ ;
cmdign : CMDIGN DataLine ;
The problem arises where I need a combination of 2. + 4., i.e. continuation for a command where I want to simply get the parms as an unparsed String (lines 5+6 in the example).
When I push to DataMode for CMDIGN on line 5 the continuation character is not recognized as it is swallowed by the "any until EOL" rule, so I pop back to default mode and the continuation line is considered a new command and fails to parse.
Is there a way of handling this combo properly ?
TIA - Alex

(For your example) You don't really need a CommandMode; it actually complicates things a bit.
T1Lexer.g4:
lexer grammar T1Lexer
;
CMD_START: '/';
CONTINUE_EOL_SLASH: '-' EOL_F '/' -> channel(HIDDEN);
EOL: EOL_F;
WS: [ \t]+ -> channel(HIDDEN);
DOT: [.];
COMMA: ',';
CMD1: 'CMD1';
CMD2: 'CMD2';
CMDIGN: 'CMDIGN' -> pushMode(DataMode);
VAR1: 'VAR1=';
ID: ID_LITERAL;
//=======================================
mode DataMode
;
DM_EOL: EOL_F -> type(EOL), popMode;
DATA_LINE: ( ~[\r\n]*? '-' EOL_F)* ~[\r\n]+;
//=======================================
fragment NL: '\r'? '\n';
fragment EOL_F: [ ]* NL;
fragment ID_LITERAL: [A-Z_$0-9]*? [A-Z_$]+? [A-Z_$0-9]*;
T1Parser.g4
parser grammar T1Parser
;
options {
tokenVocab = T1Lexer;
}
root: line (EOL line)* EOL? EOF;
line: commandLine | dataLine | emptyLine;
dataLine: DATA_LINE;
commandLine: CMD_START command;
emptyLine: CMD_START;
command: cmd1 | cmd2 | cmdign;
cmd1: CMD1 (VAR1 ID)+;
cmd2: CMD2 (VAR1 ID)+;
cmdign: CMDIGN DATA_LINE?;
Test Input:
/ CMD1 VAR1=VAL1 VAR1=VAL2
/ CMDIGN VAR1=BLAH VAR2=BLAH
/ CMD2 VAR1=VAL12 -
/ VAR1=VAL22
/ CMDIGN
/
/ CMDIGN VAR-1=0 -
/ VAR2=notignored
Token Stream:
[#0,0:0='/',<'/'>,1:0]
[#1,1:1=' ',<WS>,channel=1,1:1]
[#2,2:5='CMD1',<'CMD1'>,1:2]
[#3,6:6=' ',<WS>,channel=1,1:6]
[#4,7:11='VAR1=',<'VAR1='>,1:7]
[#5,12:15='VAL1',<ID>,1:12]
[#6,16:16=' ',<WS>,channel=1,1:16]
[#7,17:21='VAR1=',<'VAR1='>,1:17]
[#8,22:25='VAL2',<ID>,1:22]
[#9,26:26='\n',<EOL>,1:26]
[#10,27:27='/',<'/'>,2:0]
[#11,28:28=' ',<WS>,channel=1,2:1]
[#12,29:34='CMDIGN',<'CMDIGN'>,2:2]
[#13,35:54=' VAR1=BLAH VAR2=BLAH',<DATA_LINE>,2:8]
[#14,55:55='\n',<EOL>,2:28]
[#15,56:56='/',<'/'>,3:0]
[#16,57:57=' ',<WS>,channel=1,3:1]
[#17,58:61='CMD2',<'CMD2'>,3:2]
[#18,62:62=' ',<WS>,channel=1,3:6]
[#19,63:67='VAR1=',<'VAR1='>,3:7]
[#20,68:72='VAL12',<ID>,3:12]
[#21,73:73=' ',<WS>,channel=1,3:17]
[#22,74:76='-\n/',<CONTINUE_EOL_SLASH>,channel=1,3:18]
[#23,77:82=' ',<WS>,channel=1,4:1]
[#24,83:87='VAR1=',<'VAR1='>,4:7]
[#25,88:92='VAL22',<ID>,4:12]
[#26,93:93='\n',<EOL>,4:17]
[#27,94:94='/',<'/'>,5:0]
[#28,95:95=' ',<WS>,channel=1,5:1]
[#29,96:101='CMDIGN',<'CMDIGN'>,5:2]
[#30,102:102='\n',<EOL>,5:8]
[#31,103:103='/',<'/'>,6:0]
[#32,104:104='\n',<EOL>,6:1]
[#33,105:105='/',<'/'>,7:0]
[#34,106:106=' ',<WS>,channel=1,7:1]
[#35,107:112='CMDIGN',<'CMDIGN'>,7:2]
[#36,113:150=' VAR-1=0 - \n/
tree output:
(root
(line
(commandLine
/
(command
(cmd1 CMD1 VAR1= VAL1 VAR1= VAL2)
)
)
)
\n
(line
(commandLine
/
(command
(cmdign CMDIGN VAR1=BLAH VAR2=BLAH)
)
)
)
\n
(line
(commandLine
/
(command
(cmd2 CMD2 VAR1= VAL12 VAR1= VAL22)
)
)
)
\n
(line
(commandLine
/
(command
(cmdign CMDIGN)
)
)
)
\n
(line
(emptyLine /)
)
\n
(line
(commandLine
/
(command
(cmdign CMDIGN VAR-1=0 - \n/ VAR2=notignored)
)
)
)
<EOF>
)

How to parse expression with parenthesis?

I would like to parse an expression with parenthesis in python using textx.
For example the following DSL :
CREATE boby = sacha - ( boby & tralaa) ; 
CREATE boby = sacha & boby - ( david & lucas )
This is the grammar I tried:
Model:
    'CREATE' name=Identifier '=' exp=SetExpr
;
JoinOperator: /-/&/;
SetExpr:SetParExpr | SetBaseExpr 
;
SetBaseExpr:
    first=ID op=JoinOperator second=ID
;
SetParExpr:
    '(' SetExpr ')'
I guess I should have a list somewhere to fill with expression.
Do you have any suggestion ?

I've changed your examples just slightly: I added a semicolon to end and I put another pair of parentheses in your second example. I inferred these changes based on what you provided in your grammar. Here's the examples:
CREATE boby = sacha - ( boby & tralaa);
CREATE boby = sacha & (boby - ( david & lucas ));
To parse examples like these your grammar needs to be changed to:
Take in multiple Models (I created a Script rule that takes semi colon separated models)
Allow the second property of the SetBaseExpr rule to be an ID or a SetParExpr.
Change Identifier to ID in the model rule (I assume this is what you meant).
I made these changes and ended up with the following grammar that parses the examples I gave:
Script:
models+=Model[';'] ';'
;
Model:
'CREATE' name=ID '=' exp=SetExpr
;
JoinOperator: '-' | '&';
SetExpr:
SetParExpr | SetBaseExpr
;
SetBaseExpr:
first=ID op=JoinOperator (second=ID | second=SetParExpr)
;
SetParExpr:
'(' SetExpr ')'
;
I hope that answers your question or gives you a hint as to handle parenthetical expressions.

how to detect invalid utf8 unicode/binary in a text file

I need to detect corrupted text file where there are invalid (non-ASCII) utf-8, Unicode or binary characters.
ï¿½>tï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½wï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½oï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½_ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½oï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿¿ï¿½ï¿½ï¿½ï¿½ßï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~ï¿½ï¿ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½}ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½}wï¿½ï¿½×¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½_ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½^ï¿½ï¿½ï¿½ï¿½ï¿ï¿½sï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½?ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½wï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½}ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½yï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½oï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½}ï¿½ï¿½
what I have tried:
iconv -f utf-8 -t utf-8 -c file.csv
this converts a file from utf-8 encoding to utf-8 encoding and -c is for skipping invalid utf-8 characters. However at the end those illegal characters still got printed. Are there any other solutions in bash on linux or other languages?

Assuming you have your locale set to UTF-8 (see locale output), this works well to recognize invalid UTF-8 sequences:
grep -axv '.*' file.txt
Explanation (from grep man page):
-a, --text: treats file as text, essential prevents grep to abort once finding an invalid byte sequence (not being utf8)
-v, --invert-match: inverts the output showing lines not matched
-x '.*' (--line-regexp): means to match a complete line consisting of any utf8 character.
Hence, there will be output, which is the lines containing the invalid not utf8 byte sequence containing lines (since inverted -v)

I would grep for non ASCII characters.
With GNU grep with pcre (due to -P, not available always. On FreeBSD you can use pcregrep in package pcre2) you can do:
grep -P "[\x80-\xFF]" file
Reference in How Do I grep For all non-ASCII Characters in UNIX. So, in fact, if you only want to check whether the file contains non ASCII characters, you can just say:
if grep -qP "[\x80-\xFF]" file ; then echo "file contains ascii"; fi
# ^
# silent grep
To remove these characters, you can use:
sed -i.bak 's/[\d128-\d255]//g' file
This will create a file.bak file as backup, whereas the original file will have its non ASCII characters removed. Reference in Remove non-ascii characters from csv.

Try this, in order to find non-ASCII characters from the shell.
Command:
$ perl -ne 'print "$. $_" if m/[\x80-\xFF]/' utf8.txt
Output:
2 Pour être ou ne pas être
4 Byť či nebyť
5 是或不

What you are looking at is by definition corrupted. Apparently, you are displaying the file as it is rendered in Latin-1; the three characters ï¿½ represent the three byte values 0xEF 0xBF 0xBD. But those are the UTF-8 encoding of the Unicode REPLACEMENT CHARACTER U+FFFD which is the result of attempting to convert bytes from an unknown or undefined encoding into UTF-8, and which would properly be displayed as � (if you have a browser from this century, you should see something like a black diamond with a question mark in it; but this also depends on the font you are using etc).
So your question about "how to detect" this particular phenomenon is easy; the Unicode code point U+FFFD is a dead giveaway, and the only possible symptom from the process you are implying.
These are not "invalid Unicode" or "invalid UTF-8" in the sense that this is a valid UTF-8 sequence which encodes a valid Unicode code point; it's just that the semantics of this particular code point is "this is a replacement character for a character which could not be represented properly", i.e. invalid input.
As for how to prevent it in the first place, the answer is really simple, but also rather uninformative -- you need to identify when and how the incorrect encoding took place, and fix the process which produced this invalid output.
To just remove the U+FFFD characters, try something like
perl -CSD -pe 's/\x{FFFD}//g' file
but again, the proper solution is to not generate these erroneous outputs in the first place.
To actually answer the question about how to remove only invalid code points, try
iconv -f UTF-8 -t UTF-8//IGNORE broken-utf8.txt >fixed-utf8.txt
(You are not revealing the encoding of your example data. It is possible that it has an additional corruption. If what you are showing us is a copy/paste of the UTF-8 rendering of the data, it has been "double-encoded". In other words, somebody took -- already corrupted, as per the above -- UTF-8 text and told the computer to convert it from Latin-1 to UTF-8. Undoing that is easy; just convert it "back" to Latin-1. What you obtain should then be the original UTF-8 data before the superfluous incorrect conversion.
iconv -f utf-8 -t latin-1 mojibake-utf8.txt >fixed-utf8.txt
See also mojibake.)

This Perl program should remove all non-ASCII characters:
foreach $file (#ARGV) {
open(IN, $file);
open(OUT, "> super-temporary-utf8-replacement-file-which-should-never-be-used-EVER");
while (<IN>) {
s/[^[:ascii:]]//g;
print OUT "$_";
}
rename "super-temporary-utf8-replacement-file-which-should-never-be-used-EVER", $file;
}
What this does is it takes files as input on the command-line, like so: perl fixutf8.pl foo bar baz
Then, for each line, it replaces each instance of a non-ASCII character with nothing (deletion).
It then writes this modified line out to super-temporary-utf8-replacement-file-which-should-never-be-used-EVER (named so it dosen't modify any other files.)
Afterwards, it renames the temporary file to that of the original one.
This accepts ALL ASCII characters (including DEL, NUL, CR, etc.), in case you have some special use for them. If you want only printable characters, simply replace :ascii: with :print: in s///.
I hope this helps! Please let me know if this wasn't what you were looking for.

The following C program detects invalid utf8 characters.
It was tested and used on a linux system.
/*
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
*/
#include <stdio.h>
#include <stdlib.h>
void usage( void ) {
printf( "Usage: test_utf8 file ...\n" );
return;
}
int line_number = 1;
int char_number = 1;
char *file_name = NULL;
void inv_char( void ) {
printf( "%s: line : %d - char %d\n", file_name, line_number, char_number );
return;
}
int main( int argc, char *argv[]) {
FILE *out = NULL;
FILE *fh = NULL;
// printf( "argc: %d\n", argc );
if( argc < 2 ) {
usage();
exit( 1 );
}
// printf( "File: %s\n", argv[1] );
file_name = argv[1];
fh = fopen( file_name, "rb" );
if( ! fh ) {
printf( "Could not open file '%s'\n", file_name );
exit( 1 );
}
int utf8_type = 1;
int utf8_1 = 0;
int utf8_2 = 0;
int utf8_3 = 0;
int utf8_4 = 0;
int byte_count = 0;
int expected_byte_count = 0;
int cin = fgetc( fh );
while( ! feof( fh ) ) {
switch( utf8_type ) {
case 1:
if( (cin & 0x80) ) {
if( (cin & 0xe0) == 0xc0 ) {
utf8_1 = cin;
utf8_type = 2;
byte_count = 1;
expected_byte_count = 2;
break;
}
if( (cin & 0xf0) == 0xe0 ) {
utf8_1 = cin;
utf8_type = 2;
byte_count = 1;
expected_byte_count = 3;
break;
}
if( (cin & 0xf8) == 0xf0 ) {
utf8_1 = cin;
utf8_type = 2;
byte_count = 1;
expected_byte_count = 4;
break;
}
inv_char();
utf8_type = 1;
break;
}
break;
case 2:
case 3:
case 4:
// printf( "utf8_type - %d\n", utf8_type );
// printf( "%c - %02x\n", cin, cin );
if( (cin & 0xc0) == 0x80 ) {
if( utf8_type == expected_byte_count ) {
utf8_type = 1;
break;
}
byte_count = utf8_type;
utf8_type++;
if( utf8_type == 5 ) {
utf8_type = 1;
}
break;
}
inv_char();
utf8_type = 1;
break;
default:
inv_char();
utf8_type = 1;
break;
}
if( cin == '\n' ) {
line_number ++;
char_number = 0;
}
if( out != NULL ) {
fputc( cin, out );
}
// printf( "lno: %d\n", line_number );
cin = fgetc( fh );
char_number++;
}
fclose( fh );
return 0;
}

... I'm trying to detect if a file has corrupted characters. I'm also
interested in deleting them.
This is easy with ugrep and takes just one line:
ugrep -q -e "." -N "\p{Unicode}" file.csv && echo "file is corrupted"
To remove invalid Unicode characters:
ugrep "\p{Unicode}" --format="%o" file.csv
The first command matches any character with -e "." except valid Unicode with -N "\p{Unicode}" that is a "negative pattern" to skip.
The second command matches a Unicode character "\p{Unicode}" and writes it with --format="%o".

I am probably repeating what other have said already. But i think your invalid characters get still printed because they may be valid. The Universal Character Set is the attempt to reference the worldwide frequently used characters to be able to write robust software which is not relying on a special character-set.
So i think your problem may be one of the following both - in assumption that your overall target is to handle this (malicious) input from utf-files in general:
There are invalid utf8 characters (better called invalid byte sequences - for this i'd like to refer to the corresponding Wikipedia-Article).
There are absent equivalents in your current display-font which are substituted by a special symbol or shown as their binary ASCII-equivalent (f.e. - i therefore would like to refer to the following so-post: UTF-8 special characters don't show up).
So in my opinion you have two possible ways to handle this:
Transform the all characters from utf8 into something handleable - f.e. ASCII - this can be done f.e. with iconv -f utf-8 -t ascii -o file_in_ascii.txt file_in_utf8.txt. But be careful transferring from one the wider character-space (utf) into a smaller one might cause a data loss.
Handle utf(8) correctly - this is how the world is writing stuff. If you think you might have to rely on ASCII-chars because of any limitating post-processing step, stop and rethink. In most cases the post-processor already supports utf, it's probably better to find out how to utilize it. You're making your stuff future- and bullet-proof.
Handling utf might seem to be tricky, the following steps may help you to accomplish utf-readyness:
Be able to display utf correctly or ensure that your display-stack (os, terminal and so on) is able to display an adequate subset of unicode (which, of course, should meet your needs), this may prevent the need of a hex-editor in many cases. Unfortunately utf is too big, to come in one font, but a good point to start at is this so-post: https://stackoverflow.com/questions/586503/complete-monospaced-unicode-font
Be able to filter invalid byte sequences. And there are many ways to achieve that, this ul-post shows a plenty variety of these ways: Filtering invalid utf8 - i want to especially point out the 4th answer which suggests to use uconv which allows you to set a callback-handler for invalid sequences.
Read a bit more about unicode.

A very dirty solution in python 3
import sys
with open ("cur.txt","r",encoding="utf-8") as f:
for i in f:
for c in i:
if(ord(c)<128):
print(c,end="")
The output should be:
>two_o~}}w~_^s?w}yo}

Using Ubuntu 22.04, I get more correct answer by using:
grep -axv -P '.*' file.txt
The original answer without the -P, seems to give false positives for a lot of asian characters, like:
<lei:LegalName xml:lang="ko">피씨에이생명보험주식회사</lei:LegalName>
<lei:LegalName xml:lang="ko">린드먼 부품소재 전문투자조합 1</lei:LegalName>
<lei:LegalName xml:lang="ko">비엔피파리바 카디프손해보험 주식회사</lei:LegalName>
These characters do pass the scanning of the isutf8 utility.

antlr4 mismatch input error on sql parser

I am getting following error on parsing but not sure why it's happening.
line 1:24 mismatched input '1' expecting NUM
line 1:24 mismatched input '1' expecting NUM
select a from abc limit 1 ;
--
grammar SQLCmd;
parse : sql
;
sql : ('select' ((columns (',' columns))|count) 'from')
tables
('where' condition ((and|or) condition))* (limit)? ';'
;
limit : 'limit' NUM
;
num : NUM
;
count : 'count(*)'
;
columns : VAL
;
tables : VAL
;
condition : ( left '=' right )+
;
and : 'and'
;
or : 'or'
;
left : VAL
;
right : VAL
;
VAL : [*a-z0-9A-Z~?]+
;
NUM : [0-9]+
;
WS : [ \t\n\r]+ -> skip
;

It looks like you have a VAL instead of a NUM.
The "1" is both a VAL and a NUM but since VAL comes first, there will never be NUM tokens since every NUM will be a VAL.
Try putting the NUM rule before the VAL rule.
You could have found out this by yourself by looking at the token types from the lexer. This will tell you the actual type of the token that is present.
#TheAntlrGuy: Maybe one could add the actual token type to the error message?

Why does AntlrWorks 2 display warning 125 (implicit definition of token in parser) in this case?

I have a separate lexer and parser grammar (derived from the sample ModeTagsLexer/ModeTagsParser) and get a warning in AntlrWorks 2 that I don't understand:
warning(125): implicit definition of token OPEN in parser
If I replace the OPEN rule with '<' the warning goes away. I wonder what the difference between OPEN and CLOSE ist which get's no warning.
I'm using antlr-4.1-complete.jar and 2013-01-22-antlrworks-2.0.
Lexer STLexer.g4:
lexer grammar STLexer;
// Default mode rules (the SEA)
OPEN : '<' -> pushMode(ISLAND) ; // switch to ISLAND mode
TEXT : ~'<'+ ; // clump all text together
mode ISLAND;
CLOSE : '>' -> popMode ; // back to SEA mode
SLASH : '/' ;
ID : [a-zA-Z0-9"=]+ ; // match/send ID in tag to parser
WS : [ \t]+ -> channel(HIDDEN);
Parser STParser.g4:
parser grammar STParser;
options { tokenVocab=STLexer; } // use tokens from STLexer.g4
unit: (tag | TEXT)* ;
tag : OPEN ID+ CLOSE
| OPEN SLASH ID+ CLOSE
;
It even persists if I rename the rule slightly and remove the additional mode:
lexer grammar STLexer;
Lexer (modified):
// Default mode rules (the SEA)
OPPEN : '<' ;// -> pushMode(ISLAND) ; // switch to ISLAND mode
TEXT : ~'<'+ ; // clump all text together
//mode ISLAND;
CLOSE : '>' ; // -> popMode ; // back to SEA mode
SLASH : '/' ;
ID : [a-zA-Z0-9"=]+ ; // match/send ID in tag to parser
WS : [ \t]+ -> channel(HIDDEN);
Parser (modified):
parser grammar STParser;
options { tokenVocab=STLexer; } // use tokens from STLexer.g4
unit: (tag | TEXT)* ;
tag : ID OPPEN ID+ CLOSE
| ID OPPEN SLASH ID+ CLOSE
;

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string