ANTLR4 handling continuations for "any data" - antlr4

The grammar I need to create is based on the following:
Command lines start with a slash
Command lines can be continued with a hyphen as the last character
(excluding whitespaces) on a line
For some commands I want to parse their parameters
For other commands I am not interested in their parameters
This works almost fine with the following (simplified) Lexer
lexer grammar T1Lexer;
NewLine
: [\r\n]+ -> skip
;
CommandStart
: '/' -> pushMode(CommandMode)
;
DataStart
: . -> more, pushMode(DataMode)
;
mode DataMode;
DataLine
: ~[\r\n]+ -> popMode
;
mode CommandMode;
CmNL
: [\r\n]+ -> skip, popMode
;
CONTINUEMINUS : ( '-' [ ]* ('\r/' | '\n/' | '\r\n/') ) -> channel(HIDDEN);
EOL: ( [ ]* ('\r' | '\n' | '\r\n') ) -> popMode;
SPACE : [ \t\r\n]+ -> channel(HIDDEN) ;
DOT : [.] ;
COMMA : ',' ;
CMD1 : 'CMD1';
CMD2 : 'CMD2';
CMDIGN : 'CMDIGN' -> pushMode(DataMode) ;
VAR1 : 'VAR1=' ;
ID : ID_LITERAL;
fragment ID_LITERAL: [A-Z_$0-9]*?[A-Z_$]+?[A-Z_$0-9]*;
and Parser:
parser grammar T1Parser;
options { tokenVocab=T1Lexer; }
root : line+ EOF ;
line: ( commandLine | dataLine)+ ;
dataLine : DataLine ;
commandLine : CommandStart command ;
command : cmd1 | cmd2 | cmdign ;
cmd1 : CMD1 (VAR1 ID)+ ;
cmd2 : CMD2 (VAR1 ID)+ ;
cmdign : CMDIGN DataLine ;
The problem arises where I need a combination of 2. + 4., i.e. continuation for a command where I want to simply get the parms as an unparsed String (lines 5+6 in the example).
When I push to DataMode for CMDIGN on line 5 the continuation character is not recognized as it is swallowed by the "any until EOL" rule, so I pop back to default mode and the continuation line is considered a new command and fails to parse.
Is there a way of handling this combo properly ?
TIA - Alex

(For your example) You don't really need a CommandMode; it actually complicates things a bit.
T1Lexer.g4:
lexer grammar T1Lexer
;
CMD_START: '/';
CONTINUE_EOL_SLASH: '-' EOL_F '/' -> channel(HIDDEN);
EOL: EOL_F;
WS: [ \t]+ -> channel(HIDDEN);
DOT: [.];
COMMA: ',';
CMD1: 'CMD1';
CMD2: 'CMD2';
CMDIGN: 'CMDIGN' -> pushMode(DataMode);
VAR1: 'VAR1=';
ID: ID_LITERAL;
//=======================================
mode DataMode
;
DM_EOL: EOL_F -> type(EOL), popMode;
DATA_LINE: ( ~[\r\n]*? '-' EOL_F)* ~[\r\n]+;
//=======================================
fragment NL: '\r'? '\n';
fragment EOL_F: [ ]* NL;
fragment ID_LITERAL: [A-Z_$0-9]*? [A-Z_$]+? [A-Z_$0-9]*;
T1Parser.g4
parser grammar T1Parser
;
options {
tokenVocab = T1Lexer;
}
root: line (EOL line)* EOL? EOF;
line: commandLine | dataLine | emptyLine;
dataLine: DATA_LINE;
commandLine: CMD_START command;
emptyLine: CMD_START;
command: cmd1 | cmd2 | cmdign;
cmd1: CMD1 (VAR1 ID)+;
cmd2: CMD2 (VAR1 ID)+;
cmdign: CMDIGN DATA_LINE?;
Test Input:
/ CMD1 VAR1=VAL1 VAR1=VAL2
/ CMDIGN VAR1=BLAH VAR2=BLAH
/ CMD2 VAR1=VAL12 -
/ VAR1=VAL22
/ CMDIGN
/
/ CMDIGN VAR-1=0 -
/ VAR2=notignored
Token Stream:
[#0,0:0='/',<'/'>,1:0]
[#1,1:1=' ',<WS>,channel=1,1:1]
[#2,2:5='CMD1',<'CMD1'>,1:2]
[#3,6:6=' ',<WS>,channel=1,1:6]
[#4,7:11='VAR1=',<'VAR1='>,1:7]
[#5,12:15='VAL1',<ID>,1:12]
[#6,16:16=' ',<WS>,channel=1,1:16]
[#7,17:21='VAR1=',<'VAR1='>,1:17]
[#8,22:25='VAL2',<ID>,1:22]
[#9,26:26='\n',<EOL>,1:26]
[#10,27:27='/',<'/'>,2:0]
[#11,28:28=' ',<WS>,channel=1,2:1]
[#12,29:34='CMDIGN',<'CMDIGN'>,2:2]
[#13,35:54=' VAR1=BLAH VAR2=BLAH',<DATA_LINE>,2:8]
[#14,55:55='\n',<EOL>,2:28]
[#15,56:56='/',<'/'>,3:0]
[#16,57:57=' ',<WS>,channel=1,3:1]
[#17,58:61='CMD2',<'CMD2'>,3:2]
[#18,62:62=' ',<WS>,channel=1,3:6]
[#19,63:67='VAR1=',<'VAR1='>,3:7]
[#20,68:72='VAL12',<ID>,3:12]
[#21,73:73=' ',<WS>,channel=1,3:17]
[#22,74:76='-\n/',<CONTINUE_EOL_SLASH>,channel=1,3:18]
[#23,77:82=' ',<WS>,channel=1,4:1]
[#24,83:87='VAR1=',<'VAR1='>,4:7]
[#25,88:92='VAL22',<ID>,4:12]
[#26,93:93='\n',<EOL>,4:17]
[#27,94:94='/',<'/'>,5:0]
[#28,95:95=' ',<WS>,channel=1,5:1]
[#29,96:101='CMDIGN',<'CMDIGN'>,5:2]
[#30,102:102='\n',<EOL>,5:8]
[#31,103:103='/',<'/'>,6:0]
[#32,104:104='\n',<EOL>,6:1]
[#33,105:105='/',<'/'>,7:0]
[#34,106:106=' ',<WS>,channel=1,7:1]
[#35,107:112='CMDIGN',<'CMDIGN'>,7:2]
[#36,113:150=' VAR-1=0 - \n/
tree output:
(root
(line
(commandLine
/
(command
(cmd1 CMD1 VAR1= VAL1 VAR1= VAL2)
)
)
)
\n
(line
(commandLine
/
(command
(cmdign CMDIGN VAR1=BLAH VAR2=BLAH)
)
)
)
\n
(line
(commandLine
/
(command
(cmd2 CMD2 VAR1= VAL12 VAR1= VAL22)
)
)
)
\n
(line
(commandLine
/
(command
(cmdign CMDIGN)
)
)
)
\n
(line
(emptyLine /)
)
\n
(line
(commandLine
/
(command
(cmdign CMDIGN VAR-1=0 - \n/ VAR2=notignored)
)
)
)
<EOF>
)

Related

ANTLR4 catch an entire line of arbitrary data

I have a grammar with command lines starting with a / and "data lines" which is everything that does not start with a slash.
I just can't get it to be parsed correctly, the following rule
FM_DATA: ( ('\r' | '\n' | '\r\n') ~'/') -> mode(DATA_MODE);
does almost what I need but for a data line of
abcde
the following tokens are generated
[#23,170:171='\na',<4>,4:72]
[#24,172:175='bcde',<103>,5:1]
so the first character is swallowed by the rule.
I also tried
FM_DATA: ( {getCharPositionInLine() == 0}? ~'/') -> mode(DATA_MODE);
but this causes even weirder things.
What's the correct rule for getting this to work as expected ?
TIA - Alex
The ... -> more command can be used to let the first char (or first part of a lexer rule) not be consumed (yet).
A quick demo:
lexer grammar FmDataLexer;
NewLine
: [\r\n]+ -> skip
;
CommandStart
: '/' -> pushMode(CommandMode)
;
FmDataStart
: . -> more, pushMode(FmDataMode)
;
mode CommandMode;
CommandLine
: ~[\r\n]+ -> popMode
;
mode FmDataMode;
FmData
: ~[\r\n]+ -> popMode
;
If you run the following code:
FmDataLexer lexer = new FmDataLexer(CharStreams.fromString("abcde\n/mu"));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
for (Token t : stream.getTokens()) {
System.out.printf("%-20s '%s'\n", FmDataLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
you'll get this output:
FmData 'abcde'
CommandStart '/'
CommandLine 'mu'
EOF '<EOF>'
See: https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md#mode-pushmode-popmode-and-more

How to format lines in a file by using shell language?

The purpose of the program is to make comments in the file begin in the same column.
if a line begins with ; then it doesn't change
if a line begins with code then ; the program should insert space before ; so it will start in the same column with the farthest ;
for example:
Before:
; Also change "-f elf " for "-f elf64" in build command.
;
section .data ; section for initialized data
str: db 'Hello world!', 0Ah ; message string with new-line char
; at the end (10 decimal)
After:
; Also change "-f elf " for "-f elf64" in build command. # These two line don't change
; # because they start with ;
section .data ; section for initialized data
str: db 'Hello world!', 0Ah ; message string with new-line char
; at the end (10 decimal)
I am a beginner in Linux and shell, so far I have got
echo "Enter the filename"
read name
cat $name | while read line;
do ....
Our teacher told us that we should use two while loop;
Record the longest length before; in the first loop and do the changes in the second while loop.
for now I don't know how to use awk or sed to find the longest length before;
Any ideas?
Here is the solution, assuming that comments in your file begin with the first semi-colon (;) that is not inside a string:
$ cat tst.awk
BEGIN{ ARGV[ARGC] = ARGV[ARGC-1]; ARGC++ }
{
nostrings = ""
tail = $0
while ( match(tail,/'[^']*'/) ) {
nostrings = nostrings substr(tail,1,RSTART-1) sprintf("%*s",RLENGTH,"")
tail = substr(tail,RSTART+RLENGTH)
}
nostrings = nostrings tail
cur = index(nostrings,";")
}
NR==FNR { max = (cur > max ? cur : max); next }
cur > 1 { $0 = sprintf("%-*s%s", max-1, substr($0,1,cur-1), substr($0,cur)) }
{ print }
.
$ awk -f tst.awk file
; Also change "-f elf " for "-f elf64" in build command.
;
section .data ; section for initialized data
str: db 'Hello; world!', 0Ah ; message string with new-line char
; at the end (10 decimal)
and below is how you get to it from a naive starting point (I added a semi-colon inside your Hello World! string for testing - make sure to verify all suggested solutions using that).
Note that the above DOES contain 2 loops on the input as your teacher suggests, but you do not need to manually write them as awk provides the loops for you each time it reads the file. If your input file contains tabs or similar then you need to remove them in advance, e.g. by using pr -e -t.
Here is how you get to the above:
If you cannot have semi-colons in other contexts than as the start of comments then all you need is:
$ cat tst.awk
{ cur = index($0,";") }
NR==FNR { max = (cur > max ? cur : max); next }
cur > 1 { $0 = sprintf("%-*s%s", max-1, substr($0,1,cur-1), substr($0,cur)) }
{ print }
which you'd execute as awk -f tst.awk file file (yes, specify your input file twice).
If your code can contain semi-colons in contexts that are not the start of a comment, e.g. in the middle of a string, then you need to tell us how we can identify semi-colons in comment-start vs other contexts but if it can ONLY appear between singe quotes in strings, e.g. the ; inside 'Hello; World!' below:
$ cat file
; Also change "-f elf " for "-f elf64" in build command.
;
section .data ; section for initialized data
str: db 'Hello; world!', 0Ah ; message string with new-line char
; at the end (10 decimal)
then this is all you need to replace every string with a series of blank chars before finding the first semi-colon (which is then presumably the start of a comment):
$ cat tst.awk
{
nostrings = ""
tail = $0
while ( match(tail,/'[^']*'/) ) {
nostrings = nostrings substr(tail,1,RSTART-1) sprintf("%*s",RLENGTH,"")
tail = substr(tail,RSTART+RLENGTH)
}
nostrings = nostrings tail
cur = index(nostrings,";")
}
...the rest as before...
and finally if you don't want to specify the file name twice on the command line, just duplicate it's name in the ARGV[] array by adding this line at the top:
BEGIN{ ARGV[ARGC] = ARGV[ARGC-1]; ARGC++ }
There are a few printf tricks that make this a manageable project. Take a look at the following. The script formats the assembly file with the assembly code beginning at column 0 to code_width - 1 with the comments following at column code_width lined up after the code. The script is fairly well commented so you should be able to follow along.
The usage is:
bash nameofscript.sh input_file [code_width (default 46char)]
or if you make nameofscript.sh executable, then simply:
./nameofscript.sh input_file [code_width (default 46char)]
NOTE: this script requires Bash, if not run on bash, you may experience inconsistent results. If you have multiple embedded ; in each line, the first will be considered the beginning of a comment. Let me know if you have questions.
#!/bin/bash
## basic function to trim (or stip) the leading & trailing whitespace from a variable
# passed to the fuction. Usage: VAR=$(trimws $VAR)
function trimws {
[ -z "$1" ] && return 1
local strln="${#1}"
[ "$strln" -lt 2 ] && return 1
local trimstr=$1
trimstr="${trimstr#"${trimstr%%[![:space:]]*}"}" # remove leading whitespace characters
trimstr="${trimstr%"${trimstr##*[![:space:]]}"}" # remove trailing whitespace characters
printf "%s" "$trimstr"
return 0
}
afn="$1" # input assembly filename
cwidth=${2:--46} # code field width (- is left justified)
[ "${cwidth:0:1}" = '-' ] || cwidth=-${cwidth} # make sure first char is '-'
[ -r "$afn" ] || { # validate input file is readable
printf "error: file not found: '%s'. Usage: %s <filename> [code_width (46 ch)]\n" "$afn" "${0//\//}"
exit 1
}
## loop through file splitting on ';'
while IFS=$';\n' read -r code comment || [ -n "$comment" ]; do
[ -n "$code" ] || { # if no '$code' comment only line
if [ -n "$comment" ]; then
printf ";%s\n" "$comment" # output the line unchanged
else
printf "\n" # it was a blank line to begin with
fi
continue # read next line
}
code=$(trimws "$code") # trim leading and trailing whitespace
comment=$(trimws "$comment") # same
printf "%*s ; %s\n" "$cwidth" "$code" "$comment" # output new format
done <"$afn"
exit 0
input:
$ cat dat/asmfile.txt
; Also change "-f elf " for "-f elf64" in build command.
;
section .data ; section for initialized data
str: db 'Hello world!', 0Ah ; message string with new-line char
; at the end (10 decimal)
output:
$ bash fmtasmcmt.sh
; Also change "-f elf " for "-f elf64" in build command.
;
section .data ; section for initialized data
str: db 'Hello world!', 0Ah ; message string with new-line char
; at the end (10 decimal)
So yeah, use a while loop to find the longest length, given your input in the local file input:
length=0
length2=0
while IFS= read -r -- i; do
(( ${#i} > length2 )) && length2=${#i}
i=${i/\;*/}
(( ${#i} > length )) && length=${#i}
done < ./input
(( length++ )); (( length2++ ))
In your next while loop, detect whether the line starts with ; using [[ ${i:0:1} = ';' ]] and output it, or format the output with awk using the length you determined: awk -F\; -v len=$length '{ printf "%-"len"s %-40s\n", $1, $2}'. Check here (http://www.unix.com/shell-programming-scripting/117543-formatting-output-columns.html) for more info on column formatting.
Edit: In case you didn't figure it out, the second loop looks like:
while IFS= read -r -- i; do
# echo the original if the line starts with ';'
[[ ${i:0:1} = ';' ]] && echo "$i" && continue
# column formatting with awk
(echo "$i" | grep -q ';') && echo "$i" | awk -v len=$length -v len2=$length2 -F\; '{printf "%-"len"s %-"len2"s\n",$1,";"$2}' || echo "$i"
done < ./input
That will give you what you want for the output.
I think I'm going to use this example for my personal formatting!
#!/usr/bin/perl -s -0
use strict;
our ($com); # command line option
$com = ";" unless defined $com ;
my $max=0;
$_= <>; # slurp file
while( /\n(.+?)$com/g ){
$max=length($1) if length($1) > $max }
s/\n(.+?)$com/sprintf("\n%-$max"."s$com",$1)/ge;
print $_; # print file
usage: align_coms input (after chmod+install)
Options: -com=... to redefine comments (default = ; )
and you can try align_coms -com=# align_coms to align this scripts perl comments :)
Edit 1:
Please see the (wise) comment of #EdMorton about problems when the input has strings (or similar) containing comment starters.
Edit 2: The following version can deal with 'alo; word' "alo; word". It is still
not safe -- real languages have always some extra detail (ex '...\'...', multiline comments) but it is a little bit more robust...
#!/usr/bin/perl -s -0
use strict;
our ($com); # command line option
$com = ";" unless defined $com ;
my $nc=qr{ # no comment regex
( '[^'\n]*' # '....'
| "[^"\n]*" # "...."
| . # common chars
)+?
}x;
my $max=0;
$_= <>; # slurp file
while( /\n($nc)$com/g ){
$max=length($1) if length($1) > $max }
s/\n($nc)$com/sprintf("\n%-$max"."s$com",$1)/ge;
print $_; # print file

Overlapping rules - mismatched input

My grammar (as follows (trimmed down from the original)) requires somewhat overlapping rules
grammar NOVIANum;
statement : (priorityStatement | integerStatement)* ;
priorityStatement : T_PRIO TwoDigits ;
integerStatement : T_INTEGER Integer ;
WS : [ \t\r\n]+ -> skip ;
T_PRIO : 'PRIO' ;
T_INTEGER : 'INTEGER' ;
Integer: OneToNine Digit* | ZERO ;
TwoDigits : Digit Digit ;
fragment OneToNine : ('1'..'9') ;
fragment Digit: ('0'..'9');
ZERO : [0] ;
so "Integer" and "TwoDigits" overlap to a certain extent.
The following input
INTEGER 10
PRIO 10
results in
line 2:5 mismatched input '10' expecting TwoDigits
when Integer precedes TwoDigits and in
line 1:8 mismatched input '10' expecting Integer
when TwoDigits precedes Integer in the grammar.
Is there a way around this ?
Thanks - Alex
Edit:
Thanks #GRosenberg, your suggestion, of course, worked for this small example, but when I integrated this into my full grammar it led to different mismatched input errors sure enough.
The reason being another lexer rule which requires a range of '[1-4]', so I thought I'll be clever and turn it into
grammar NOVIANum;
statement : (priorityT | integerT | levelT )* ;
priorityT : T_PRIO twoDigits ;
integerT : T_INTEGER integer ;
levelT : T_LEVEL levelNumber ;
levelNumber : ( ZERO DIGIT ) | ( OneToFour (ZERO | DIGIT) ) ;
integer: ZERO* ( DIGIT ( DIGIT | ZERO )* ) ;
twoDigits : (ZERO | DIGIT) ( ZERO | DIGIT ) ;
oneToFour : OneToFour (DIGIT | ZERO) ;
WS : [ \t\r\n]+ -> skip ;
T_INTEGER : 'INTEGER' ;
T_LEVEL : 'LEVEL' ;
T_PRIO : 'PRIO' ;
DIGIT: OneToFour | FiveToNine ;
ZERO : '0' ;
OneToFour : [1-4] ;
FiveToNine : [5-9] ;
This still works for the previous inputs but ...
INTEGER 350
PRIO 10
LEVEL 01
LEVEL 05
LEVEL 10
LEVEL 49
results in
[#0,0:6='INTEGER',<2>,1:0]
[#1,8:8='3',<5>,1:8]
[#2,9:9='5',<5>,1:9]
[#3,10:10='0',<6>,1:10]
[#4,12:15='PRIO',<4>,2:0]
[#5,17:17='1',<5>,2:5]
[#6,18:18='0',<6>,2:6]
[#7,20:24='LEVEL',<3>,3:0]
[#8,26:26='0',<6>,3:6]
[#9,27:27='1',<5>,3:7]
[#10,29:33='LEVEL',<3>,4:0]
[#11,35:35='0',<6>,4:6]
[#12,36:36='5',<5>,4:7]
[#13,38:42='LEVEL',<3>,5:0]
[#14,44:44='1',<5>,5:6]
[#15,45:45='0',<6>,5:7]
[#16,47:51='LEVEL',<3>,6:0]
[#17,53:53='4',<5>,6:6]
[#18,54:54='9',<5>,6:7]
[#19,55:54='<EOF>',<-1>,6:8]
line 5:6 no viable alternative at input '1'
line 6:6 no viable alternative at input '4'
(statement (integerT INTEGER (integer 3 5 0)) (priorityT PRIO (twoDigits 1 0)) (levelT LEVEL (levelNumber 0 1)) (levelT LEVEL (levelNumber 0 5)) (levelT LEVEL (levelNumber 1 0)) (levelT LEVEL (levelNumber 4 9)))
What am I missing here ?
Edit 2:
Ok, answering my own question here, of course
DIGIT: OneToFour | FiveToNine ;
kicks in where it shouldn't, even in this combined form,
so about the only way to get around this - I can think of - would be
grammar NOVIANum;
statement : (priorityT | integerT | levelT )* ;
priorityT : T_PRIO twoDigits ;
integerT : T_INTEGER integer ;
levelT : T_LEVEL levelNumber ;
levelNumber : ( ZERO (OneToFour | FiveToNine) | ( OneToFour (ZERO | (OneToFour | FiveToNine)) ) ) ;
integer: ZERO* ( (OneToFour | FiveToNine) ( (OneToFour | FiveToNine) | ZERO )* ) ;
twoDigits : (ZERO | (OneToFour | FiveToNine)) ( ZERO | (OneToFour | FiveToNine) ) ;
WS : [ \t\r\n]+ -> skip ;
T_INTEGER : 'INTEGER' ;
T_LEVEL : 'LEVEL' ;
T_PRIO : 'PRIO' ;
// DIGIT: OneToFour | FiveToNine;
ZERO : '0' ;
OneToFour : [1-4] ;
FiveToNine : [5-9] ;
because when I create a parser rule for it like
oneToNine : OneToFour | FiveToNine ;
it'll give me this
integerT INTEGER (integer (oneToNine 3) (oneToNine 5) 0))
which is ugly and harder to handle than just
(integerT INTEGER (integer 3 5 0))
As an general issue of design, always try to work with distinguishing elements and their objects (T_PRIO -> TwoDigits) at the same level, parser or lexer. Presuming the semantic nature of the Integer and TwoDigits rules is important, promote them to the parser and let the lexer only produce digits. That is, don't over-constrain the lexer.
In the parser, you can let the integer rule functionally hide the twoDigits rule except in the evaluation of the priorityStatement rule:
priorityStatement : T_PRIO twoDigits ;
integerStatement : T_INTEGER integer ;
integer: ZERO | ( DIGIT ( DIGIT | ZERO )* ) ;
twoDigits : DIGIT DIGIT ;
T_PRIO : 'PRIO' ;
T_INTEGER : 'INTEGER' ;
DIGIT : [1-9] ;
ZERO : '0' ;

remove a line with special character with given pattern

I'm trying to get the lines with special characters which is not prefixed with \. Below are the special characters:
^$%.*+?!(){}[]|\
I need to check all the above special characters which is not prefixed with \ in 2nd column. I'm trying with awk to complete this, but no luck. I want the output as below.
input.txt
1,ap^ple
2,o$range
3,bu+tter
4,gr(ape
5,sm\(ok\e
6,ra\in
7,p+la\\y
8,wor\+k
output.txt
1,ap^ple
2,o$range
3,bu+tter
4,gr(ape
5,sm\(ok\e
6,ra\in
7,p+la\\y
7th row and 5 row are in output.txt because there is 2 special charcters(one is with backslash another without backslash)
"final" final edit: I wanted to allow "\x" whatever x is, but the OP seems to not want that, so I fixed it too.
After trying to find a "clever" regexp (which choked on "\\" or any impair number of "\", but apparently worked for the rest...)
I re-wrote it in awk to do it in a "state automata" way:
The idea:
If in "normal mode", we encounter a special char other than "\" ? : we print the line!
If in "normal mode", we encounter a "\" ? : we enter "escaped mode", and in that mode, ignore the next char
(but if we don't have a next char, we need to print that line too!)
the script:
awk -F"," '
{
IN_ESCAPED_MODE=0 ;
for (i=1 ; i<=length($2) ; i++)
{ char=substr($2,i,1)
if ( IN_ESCAPED_MODE == 0)
{ if ( index(".^$%*+?!(){}[]|",char) > 0 )
{ print $0 ; break ;
}
if ( index("\\" , char ) > 0 )
{ IN_ESCAPED_MODE=1 ; continue ;
}
}
if ( IN_ESCAPED_MODE == 1)
{ if ( index(".^$%*+?!(){}[]|\\",char) > 0 )
{ IN_ESCAPED_MODE=0 ; continue ;
}
else
{ IN_ESCAPED_MODE=0 ; print $0; break;
}
}
}
if (IN_ESCAPED_MODE == 1)
{
print $0 ; break ;
}
}
' input.txt > output.txt
With this change, you will have the same output as the OP, which prints a line when it contains "\e" for example... Which I find weird: to me "\e" is fine, we can "escape" anything?
With that input:
1,ap^ple
2,o$range
3,bu+tter
4,gr(ape
5,sm\(ok\e
6,ra\in
7,p+la\\y
8,wor\+k
10,\
11,\\
12,\\\
13,.
14,\.
15,..
16,^
17,\^
18,$
19,\$
20,%
21,\%
22,*
23,\*
24,+
25,\+
26,?
27,\?
28,!
29,\!
30,(
31,\(
32,)
33,\)
34,{
35,\{
36,}
37,\}
38,[
39,\[
40,]
41,\]
42,|
43,\|
it outputs:
1,ap^ple
2,o$range
3,bu+tter
4,gr(ape
5,sm\(ok\e
6,ra\in
7,p+la\\y
10,\
12,\\\
13,.
15,..
16,^
18,$
20,%
22,*
24,+
26,?
28,!
30,(
32,)
34,{
36,}
38,[
40,]
42,|
(so it appears to really work this time !)
If you prefer to allow any "\x" and NOT only if "x" is a SPECIAL char:
change the "middle lines":
if ( IN_ESCAPED_MODE == 1)
{ if ( index(".^$%*+?!(){}[]|\\",char) > 0 )
{ IN_ESCAPED_MODE=0 ; continue ;
}
else
{ IN_ESCAPED_MODE=0 ; print $0; break;
}
}
into:
if ( IN_ESCAPED_MODE == 1)
{ IN_ESCAPED_MODE=0 ; continue ;
}
for historical reason : the regexp (which worked in "most" cases but choked in some, for example if there was "\\") :
egrep '[^\][].^$%*+?!(){}[|]|[^\][\][^].^$%*+?!(){}[|\]' input.txt > output.txt
But that one will not display the line 12, for example...
A good read: http://www.regular-expressions.info/charclass.html .... and http://www.gnu.org/software/gawk/manual/html_node/Gory-Details.html (scary ...)
You can try the following:
awk '
{
line=$0
sub(/\\[\^$%.*+?!(){}\[\]|\\]/,"")
if(/[\^$%.*+?!(){}\[\]|\\]/)
print line
}' input.txt
sed '/[]\\^$%.*+?!(){}[|]/ {
h
s/\\[]\\^$%.*+?!(){}[|]/_/g
/[]\\^$%.*+?!(){}[|]/ {
x
p
}
}' YourFile
Depending of shell and sed could be interpreted (especialy the \) differently. Works on my AIX/KSH

How to convert string to integer in UNIX shelll

I have d1="11" and d2="07". I want to convert d1 and d2 to integers and perform d1-d2. How do I do this in UNIX?
d1 - d2 currently returns "11-07" as result for me.
The standard solution:
expr $d1 - $d2
You can also do:
echo $(( d1 - d2 ))
but beware that this will treat 07 as an octal number! (so 07 is the same as 7, but 010 is different than 10).
Any of these will work from the shell command line. bc is probably your most straight forward solution though.
Using bc:
$ echo "$d1 - $d2" | bc
Using awk:
$ echo $d1 $d2 | awk '{print $1 - $2}'
Using perl:
$ perl -E "say $d1 - $d2"
Using Python:
$ python -c "print $d1 - $d2"
all return
4
An answer that is not limited to the OP's case
The title of the question leads people here, so I decided to answer that question for everyone else since the OP's described case was so limited.
TL;DR
I finally settled on writing a function.
If you want 0 in case of non-int:
int(){ printf '%d' ${1:-} 2>/dev/null || :; }
If you want [empty_string] in case of non-int:
int(){ expr 0 + ${1:-} 2>/dev/null||:; }
If you want find the first int or [empty_string]:
int(){ expr ${1:-} : '[^0-9]*\([0-9]*\)' 2>/dev/null||:; }
If you want find the first int or 0:
# This is a combination of numbers 1 and 2
int(){ expr ${1:-} : '[^0-9]*\([0-9]*\)' 2>/dev/null||:; }
If you want to get a non-zero status code on non-int, remove the ||: (aka or true) but leave the ;
Tests
# Wrapped in parens to call a subprocess and not `set` options in the main bash process
# In other words, you can literally copy-paste this code block into your shell to test
( set -eu;
tests=( 4 "5" "6foo" "bar7" "foo8.9bar" "baz" " " "" )
test(){ echo; type int; for test in "${tests[#]}"; do echo "got '$(int $test)' from '$test'"; done; echo "got '$(int)' with no argument"; }
int(){ printf '%d' ${1:-} 2>/dev/null||:; };
test
int(){ expr 0 + ${1:-} 2>/dev/null||:; }
test
int(){ expr ${1:-} : '[^0-9]*\([0-9]*\)' 2>/dev/null||:; }
test
int(){ printf '%d' $(expr ${1:-} : '[^0-9]*\([0-9]*\)' 2>/dev/null)||:; }
test
# unexpected inconsistent results from `bc`
int(){ bc<<<"${1:-}" 2>/dev/null||:; }
test
)
Test output
int is a function
int ()
{
printf '%d' ${1:-} 2> /dev/null || :
}
got '4' from '4'
got '5' from '5'
got '0' from '6foo'
got '0' from 'bar7'
got '0' from 'foo8.9bar'
got '0' from 'baz'
got '0' from ' '
got '0' from ''
got '0' with no argument
int is a function
int ()
{
expr 0 + ${1:-} 2> /dev/null || :
}
got '4' from '4'
got '5' from '5'
got '' from '6foo'
got '' from 'bar7'
got '' from 'foo8.9bar'
got '' from 'baz'
got '' from ' '
got '' from ''
got '' with no argument
int is a function
int ()
{
expr ${1:-} : '[^0-9]*\([0-9]*\)' 2> /dev/null || :
}
got '4' from '4'
got '5' from '5'
got '6' from '6foo'
got '7' from 'bar7'
got '8' from 'foo8.9bar'
got '' from 'baz'
got '' from ' '
got '' from ''
got '' with no argument
int is a function
int ()
{
printf '%d' $(expr ${1:-} : '[^0-9]*\([0-9]*\)' 2>/dev/null) || :
}
got '4' from '4'
got '5' from '5'
got '6' from '6foo'
got '7' from 'bar7'
got '8' from 'foo8.9bar'
got '0' from 'baz'
got '0' from ' '
got '0' from ''
got '0' with no argument
int is a function
int ()
{
bc <<< "${1:-}" 2> /dev/null || :
}
got '4' from '4'
got '5' from '5'
got '' from '6foo'
got '0' from 'bar7'
got '' from 'foo8.9bar'
got '0' from 'baz'
got '' from ' '
got '' from ''
got '' with no argument
Note
I got sent down this rabbit hole because the accepted answer is not compatible with set -o nounset (aka set -u)
# This works
$ ( number="3"; string="foo"; echo $((number)) $((string)); )
3 0
# This doesn't
$ ( set -u; number="3"; string="foo"; echo $((number)) $((string)); )
-bash: foo: unbound variable
let d=d1-d2;echo $d;
This should help.
Use this:
#include <stdlib.h>
#include <string.h>
int main()
{
const char *d1 = "11";
int d1int = atoi(d1);
printf("d1 = %d\n", d1);
return 0;
}
etc.

Resources