context sensitive lexing & parsing

context sensitive lexing & parsing - antlr4

I have a set of files to parse that has this weird contents. Each line contains the following data(removing all the other contents of no relevance here):
DATA < Alphabetic data> < numeric data>< Period as terminator>
and then we could have text which may be as follows:
TEXT < Alphanumeric data + puntuation (including period)>
So I am having problem in parsing the DATA line as the numeric data can be of any of the following type:
99.9
99.
.9
Especially to parse data like:
DATA ANACRON99..
The first dot at the end is the decimal point and second is the terminator
A sample of the grammar I tried, copying just the relevant portion is as follows:
file: lines+ EOF
;
lines: data_line
| text_line
;
text_line: TEXT TEXTUALDATA
;
data_line: DATA sensordata
;
sensordata: DATA FLOATVALUE PERIOD
;
TEXT:'TEXT';
DATA: 'DATA' ->mode(SENSORMODE);
TEXTUALDATA: (.)*?
;
mode SENSORMODE;
FLOATVALUE: ([0-9])*('.')([0-9])*
;
WS:[ \t]->skip
;
WS2:[\r\n]
;
PERIOD:'.' ->mode(DEFAULT_MODE)
;
This detects the first period as part of floatdata, but completely ignores the second and complains it was expecting PERIOD but found EOF. What could be a way to solve this please. Is there any way to look ahead and at the same time keep track of the last token detected?
Thanks!!

FLOATVALUE can match a single period and it is listed before PERIOD. So, the lexer is matching two FLOATVALUEs in series.
To avoid the ambiguity, change FLOATVALUE to:
FLOATVALUE: ([0-9])+('.')([0-9])+
| ([0-9])+('.')
| ('.')([0-9])+
;

To avoid that FLOATVALUE matches your PERIOD you can switch the sequence of your lexer definitions.
mode SENSORMODE;
PERIOD:'.' ->mode(DEFAULT_MODE)
;
FLOATVALUE: ([0-9])*('.')([0-9])*
;
WS:[ \t]->skip
;
WS2:[\r\n]
;
ANTLR always returns the token type of the longest matching lexer rule and in case of same-length matches it returns the first type. The modified grammar moves PERIOD to the first position.

Related

Inconsistent behaviour concatenating macro variables

I am trying to create a string by concatenating several variables/delimiters within a macro:
%macro write_to_string();
%let delim = = ;
%let string = %sysfunc(catx(%str( ),
&string, \,
step start,
%nrstr(%superq(delim)),
&etls_stepStartTime,
|,
output table,
%nrstr(%superq(delim)),
&SYSLAST,
|,
transform return code,
%nrstr(%superq(delim)),
&trans_rc));
%mend;
The macro is called at the end of several transformations (within SAS DI), so the string keeps having text appended at the end.
If the each instance of %nrstr(%superq(delim)) is replaced with some other delimiter, : say, then the above macro behaves as intended. But with the code as above what I get a 0 followed by the last string to have been appended.
I am quite ignorant about macro variables and functions and am struggling to understand
Why the choice of delimiter seems to affect whether the string is properly appended
Why macro variables sometimes need to be referenced with the preceding & and sometimes not.
Any help is greatly appreciated!
EDIT
The input variables in the code above are autogenerated by the SAS DI system and reset after each transformation in the job. The values look something like
&etls_stepStartTime = 16FEB2017:17:25:37
&SYSLAST = WORK.MY_TABLE_NAME
&trans_rc = 0
Here the value of &trans_rc will indicate the error/warning status of the last transformation that ran.
So my desired output (with the &delim variable working) would be values of the form
step start = 16FEB2017:17:25:37 | output table = WORK.MY_TABLE_NAME | transform return code = 0
delimited by \. As mentioned above, what I get is only the last value (the one corresponding to the last transformation) with a preceding 0\, unless I change the delimiter to some non-reserved character constant.

Don't use %SYSFUNC() with the CAT... series of functions. First of all you don't need them as in macro code you can just place the text where you want it. Second since those functions can work on either numeric or character arguments. This means that SAS has to try to figure out whether the text that your macro code is generating as the arguments represent a number or a character string. That is probably why the equal signs result in zeros. SAS is treating the equal sign as equality test so the zero means that the values on each side are not equal.
%let string =&string \ step start &delim &etls_stepStartTime ;
%let string =&string | output table &delim &SYSLAST ;
%let string =&string | transform return code &delim &trans_rc ;

Hyphen with strings in PROC FORMAT

I am working with IC9 codes and am creating somewhat of a mapping between codes and an integer:
proc format library = &formatlib;
invalue category other = 0
'410'-'410.99', '425.4'-'425.99' = 1
I have searched and searched, but haven't been able to find an explanation of how that range actually works when it comes to formatting.
Take the first range, for example. I assume SAS interprets '410'-'410.99' as "take every value between the inclusive range [410, 410.99] and convert it to a 1. Please correct me if I'm wrong in that assumption. Does SAS treat these seeming strings as floating-point decimals, then? I think that must be the case if these are to be numerical ranges for formatting all codes within the range.
I'm coming to SAS from the worlds of R and Python, and thus the way quote characters are used in SAS sometimes is unclear (like when using %let foo = bar... not quotes are used).

When SAS compares string values with normal comparison operators, what it does is compare the byte representation of each character in the string, one at a time, until it reaches a difference.
So what you're going to see here is when a string is input, it will be compared to the 'start' string and, if greater than start, then compared to the 'end' string, and if less than end, evaluated to a 1; if it's not for each pair listed, then evaluated to a zero.
Importantly, this means that some nonsensical results could occur - see the last row of the following test, for example.
proc format;
invalue category other = 0
'410'-'410.99', '425.4'-'425.99' = 1
;
quit;
data test;
input #1 testval $6.;
category=input(testval,category.);
datalines;
425.23
425.45
425.40
410#
410.00
410.AA
410.7A
;;;;
run;
410.7A is compared to 410 and found greater, as '4'='4', '1'='1', '0'='0', '.' > ' ', so greater . Then 410.7A is compared to 410.99 and found less, as '4'='4', '1'='1', '0'='0', '7' < '9', so less. The A is irrelevant to the comparison. But on the row above it you see it's not in the sequence, since A is ASCII 41x and that is not less than '9' (ASCII 39x).
Note that all SAS strings are filled to their full length by spaces. This can be important in string comparisons, because space is the lowest-valued printable character (if you consider space printable). Thus any character you're likely to compare to space will be higher - so for example the fourth row (410#) is a 1 because # is between and . in the ASCII table! But change that to / and it fails. Similarly, change it to byte(13) (through code) and it fails - because it is then less than space (so 410^M, with ^M representing byte(13), is less than start (410)). In informats and formats, SAS will treat the format/informat start/end as being whatever the length that it needs to - so if you're reading a 6 long string, it will treat it as length 6 and fill the rest with spaces.

SAS finding an uppercase word within a string

I have a string which contains one word in uppercase somewhere within it. I want to extract that one word into a new variable using SAS.
I think I need to find a way to code up finding a word which contains two or more uppercase letters (as the start of a sentence would begin with an uppercase letter).
i.e. How do I create the variable 'word':
data example;
length txtString $50;
length word $20;
infile datalines dlm=',';
input txtString $ word $;
datalines;
This is one EXAMPLE. Of what I need.,EXAMPLE
THIS is another.,THIS
etc ETC,ETC
;
run;
Hope someone can help and the question is clear
Thanks in advance

Consider a regex match/replace with a negative lookbehind to include two types of matches:
consecutive upper case words followed by a space with at least two characters (to avoid title cases at beginning of sentence): (([A-Z ]){2,})
consecutive upper case words followed by a period with at least two characters: (to avoid title cases at beginning of sentence): (([A-Z.]){2,})
CAVEAT: This solution works except the I article is also matched which technically is a valid match as it is also an all uppercase one-word. Being the only type in English language, consider a tranwrd() replace for such a special case. In fact, relatedly, this solution matches ALL uppercase words.
data example;
length txtString $50;
length word $20;
infile datalines dlm=',';
input txtString $ word $;
datalines;
This is one EXAMPLE. Of what I need.,EXAMPLE
THIS is another.,THIS
etc ETC,ETC
;
run;
data example;
set example;
pattern_num = prxparse("s/(?!(([A-Z ]){2,})|(([A-Z.]){2,})).//");
wordextract = prxchange(pattern_num, -1, txtString);
wordextract = tranwrd(wordextract, " I ", "");
drop pattern_num;
run;
txtString word wordextract
This is one EXAMPLE. Of what I need. EXAMPLE EXAMPLE
THIS is another. THIS THIS
etc ETC ETC ETC

SAS has a prxsubstr() function call that finds the starting position and length of a substring that matches a given regex pattern within a given string. Here's a sample solution using the prxsubstr() function call:
data solution;
set example;
/* Build a regex pattern of the word to search for, and hang on to it */
/* (The regex below means: word boundary, then two or more capital letters,
then word boundary. Word boundary here means the start or the end of a string
of letters, digits and/or underscores.) */
if _N_ = 1 then pattern_num = prxparse("/\b[A-Z]{2,}\b/");
retain pattern_num;
/* Get the starting position and the length of the word to extract */
call prxsubstr(pattern_num, txtString, mypos, mylength);
/* If a word matching the regex pattern is found, extract it */
if mypos ^= 0 then word = substr(txtString, mypos, mylength);
run;
SAS prxsubstr() documentation: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002295971.htm
Regex word boundary info: http://www.regular-expressions.info/wordboundaries.html

Perl Morgan and a String?

I am trying to solve this problem on hackerrank:
So the problem is:
Jack and Daniel are friends. Both of them like letters, especially upper-case ones.
They are cutting upper-case letters from newspapers, and each one of them has their collection of letters stored in separate stacks.
One beautiful day, Morgan visited Jack and Daniel. He saw their collections. Morgan wondered what is the lexicographically minimal string, made of that two collections. He can take a letter from a collection when it is on the top of the stack.
Also, Morgan wants to use all the letters in the boys' collections.
This is my attempt in Perl:
#!/usr/bin/perl
use strict;
use warnings;
chomp(my $n=<>);
while($n>0){
chomp(my $string1=<>);
chomp(my $string2=<>);
lexi($string1,$string2);
$n--;
}
sub lexi{
my($str1,$str2)=#_;
my #str1=split(//,$str1);
my #str2=split(//,$str2);
my $final_string="";
while(#str2 && #str1){
my $st2=$str2[0];
my $st1=$str1[0];
if($st1 le $st2){
$final_string.=$st1;
shift #str1;
}
else{
$final_string.=$st2;
shift #str2;
}
}
if(#str1){
$final_string=$final_string.join('',#str1);
}
else{
$final_string=$final_string.join('',#str2);
}
print $final_string,"\n";
}
Sample Input:
2
JACK
DANIEL
ABACABA
ABACABA
The first line contains the number of test cases, T.
Every next two lines have such format: the first line contains string A, and the second line contains string B.
Sample Output:
DAJACKNIEL
AABABACABACABA
But for Sample test-case it is giving right results while it is giving wrong results for other test-cases. One case for which it gives an incorrect result is
1
AABAC
AACAB
It outputs AAAABACCAB instead of AAAABACABC.
I don't know what is wrong with the algorithm and why it is failing with other test cases?
Update:
As per #squeamishossifrage comments If I add
($str1,$str2)=sort{$a cmp $b}($str1,$str2);
The results become same irrespective of user-inputs but still the test-case fails.

The problem is in your handling of the equal characters. Take the following example:
ACBA
BCAB
When faced with two identical characters (C in my example), you naïvely chose the one from the first string, but that's not always correct. You need to look ahead to break ties. You may even need to look many characters ahead. In this case, next character after C of the second string is lower than the next character of the first string, so you should take the C from the second string first.
By leaving the strings as strings, a simple string comparison will compare as many characters as needed to determine which character to consume.
sub lexi {
my ($str1, $str2) = #_;
utf8::downgrade($str1); # Makes sure length() will be fast
utf8::downgrade($str2); # since we only have ASCII letters.
my $final_string = "";
while (length($str2) && length($str1)) {
$final_string .= substr($str1 le $str2 ? $str1 : $str2, 0, 1, '');
}
$final_string .= $str1;
$final_string .= $str2;
print $final_string, "\n";
}

Too little rep to comment thus the answer:
What you need to do is to look ahead if the two characters match. You currently do a simple le match and in the case of
ZABB
ZAAA
You'll get ZABBZAA since the first match Z will be le Z. So what you need to do (a naive solution which most likely won't be very effective) is to keep looking as long as the strings/chars match so:
Z eq Z
ZA eq ZA
ZAB gt ZAA
and at that point will you know that the second string is the one you want to pop from for the first character.
Edit
You updated with sorting the strings, but like I wrote you still need to look ahead. The sorting will solve the two above strings but will fail with these two:
ZABAZA
ZAAAZB
ZAAAZBZABAZA
Because here the correct answer is ZAAAZABAZAZB and you can't find that will simply comparing character per character

Fortran read of data with * to signify similar data

My data looks like this
-3442.77 -16749.64 893.08 -3442.77 -16749.64 1487.35 -3231.45 -16622.36 902.29
.....
159*2539.87 10*0.00 162*2539.87 10*0.00
which means I start with either 7 or 8 reals per line and then (towards the end) have 159 values of 2539.87 followed by 10 values of 0 followed by 162 of 2539.87 etc. This seems to be a space-saving method as previous versions of this file format were regular 6 reals per line.
I am already reading the data into a string because of not knowing whether there are 7 or 8 numbers per line. I can therefore easily spot lines that contain *. But what then? I suppose I have to identify the location of each * and then identify the integer number before and real value after before assigning to an array. Am I missing anything?

Read the line. Split it into tokens delimited by whitespace(s). Replace the * in tokens that have it with space. Then read from the string one or two values, depending on wheather there was an asterisk or not. Sample code follows:
REAL, DIMENSION(big) :: data
CHARACTER(LEN=40) :: token
INTEGER :: iptr, count, idx
REAL :: val
iptr = 1
DO WHILE (there_are_tokens_left)
... ! Get the next token into "token"
idx = INDEX(token, "*")
IF (idx == 0) THEN
READ(token, *) val
count = 1
ELSE
! Replace "*" with space and read two values from the string
token(idx:idx) = " "
READ(token, *) count, val
END IF
data(iptr:iptr+count-1) = val ! Add "val" "count" times to the list of values
iptr = iptr + count
END DO
Here I have arbitrarily set the length of the token to be 40 characters. Adjust it according to what you expect to find in your input files.
BTW, for the sake of completeness, this method of compressing something by replacing repeating values with value/repetition-count pairs is called run-length encoding (RLE).

Your input data may have been written in a form suitable for list directed input (where the format specification in the READ statement is simply ''*''). List directed input supports the r*c form that you see, where r is a repeat count and c is the constant to be repeated.
If the total number of input items is known in advance (perhaps it is fixed for that program, perhaps it is defined by earlier entries in the file) then reading the file is as simple as:
REAL :: data(size_of_data)
READ (unit, *) data
For example, for the last line shown in your example on its own ''size_of_data'' would need to be 341, from 159+10+162+10.
With list directed input the data can span across multiple records (multiple lines) - you don't need to know how many items are on each line in advance - just how many appear in the next "block" of data.
List directed input has a few other "features" like this, which is why it is generally not a good idea to use it to parse "arbitrary" input that hasn't been written with it in mind - use an explicit format specification instead (which may require creating the format specification on the fly to match the width of the input field if that is not know ahead of time).
If you don't know (or cannot calculate) the number of items in advance of the READ statement then you will need to do the parsing of the line yourself.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

context sensitive lexing & parsing - antlr4

FLOATVALUE can match a single period and it is listed before PERIOD. So, the lexer is matching two FLOATVALUEs in series. To avoid the ambiguity, change FLOATVALUE to: FLOATVALUE: ([0-9])+('.')([0-9])+ | ([0-9])+('.') | ('.')([0-9])+ ;

Related

Inconsistent behaviour concatenating macro variables

Hyphen with strings in PROC FORMAT

SAS finding an uppercase word within a string

Perl Morgan and a String?

Fortran read of data with * to signify similar data

Categories

Resources