Cobol - parsing group items in a cobol program

Cobol - parsing group items in a cobol program - antlr4

I need to extract information from a COBOL program. I'm using the ANTLR grammar for COBOL. I need to extract group variables as a whole. I'm not able to extract this with ANTLR as the parser extracts every variable subdivision/group item as an individual element.
I need somehow to get the group items as a bunch. I'm new to COBOL, so I want to get an understanding of how the compiler understands which elements to include in a group, and where to stop.
EX:
01 EMPREC.
02 EEMPNAME.
10 FIRSTNAME PIC X(10)
10 LASTNAM PIC X(15)
07 SNO PIC X(15)
Is the above definition valid? Will the compiler include all elements(=>2 and <=49) after the first item (01 EMPREC), in the group EMPREC until it encounters another 01 or 77 ? Is this safe to assume?
Is the level information enough to derive what elements fall under a group?
Any pointers is appreciated.

I am the author of the COBOL ANTLR4 grammar you found in the ANTLR4 grammars project. The COBOL grammar generates only an Abstract Syntax Tree (AST).
In contrast, what you ask for is an Abstract Semantic Graph (ASG), which represents grouping of variables and in general relationships between AST elements.
Such an ASG is generated by the COBOL parser at my proleap-cobol-parser project. This project uses the mentioned COBOL grammar and resolves relationships between AST elements.
An example for parsing data description entries can be found in this unit test.

You actually had two questions:
"Is the [...] definition valid?" No it is not as you have no previous level 07. If you change the level of EEMPNAME to 07 or SNO to 02 it is valid. Group items may have a USAGE clause but no PICTURE.
This leads to the question "I want to get an understanding of how the compiler understands which elements to include in a group, and where to stop".
You need to store the level number together with the variable. If you want to know what is part of the group then you need to check this level and all below. If you want to check the complete level 02 group use only the variables with an higher level number below until you get to the next level 02 or a higher level (in this case 01), if you want the
Depending on your needs you additional need to check if the next variable with the same level has a REDEFINES in, in this case it belongs to the same group (storage-wise). Similar applies to level 66 (renames, doesn't have its own storage).
Level 88 has no storage either, it is just for validation entries depending on the parsing you want to do you can ignore them.
Important: level 88 does not create a sub-item, you can have multiple ones and a lower level number afterwards.
The level numbers that always defines a new item are 01, and with extensions 66, 77 and 78.
01 vargroup.
02 var-1 pic 9.
88 var-is-even values 0, 2, 4 6 8 .
88 var-is-not-even values 1 3 5 7 9.
88 var-is-big value 6 thru 9.
02 var-2 pic x.
01 new-var pic x.
77 other-var pic 9.
I suggest to read some COBOL sources and come up with a new question, if necessary. For example CBL_OC_DUMP.

I suspect you are going to need to put some additional code behind your ANTLR parser. If you tokenize each individual item, then keeping up with a stack of group items is somewhat easy. However, trying to grab the entire group item as a single production will be very hard.
Some of the challenges that ANTLR will not be up to are 1) group items can contain group items; 2) group items can redefine other items, or be redefined; 3) the little used, but very complicating level-66 renames clause.
If you treat each numbered data definition as a separate production, and maintain a stack, pushing for new items, popping once you have completed processing an item, and knowing that you have completed a group once you see the same level number again, your life will be easier.

It is quite a while now since I've done COBOL, but there are quite a lot of issues if my memory serves me correctly.
1) 01 levels always start in column 8.
2) When assigning subsiquent levels you are better off incrementing my +5
01 my-record.
05 my-name pic x(30) value spaces.
05 my-address1 pic x(40) value spaces.
3) 77 levels I thought are now obsolete since they are not an efficeint use of memory. Also when 77 levels are used they should always be defined at the start of the working storage section. Obviously record layouts are defined in file section unless using write from and read into?
4) If you are defining lots of new-var pic x. Don't use new 01 levels for each!
01 ws-flages.
05 ws_flag1 pic x value space.
05 ws_flag2 pic x value space.
etc.
For COBOL manuals try Stern & Stern.
Hope this helps!

Related

Why are labels in BASIC increments of 10?

In BASIC, tags are in increments of 10. For example, mandlebrot.bas from github/linguist:
10 REM Mandelbrot Set with ANSI Colors in BASIC
20 REM https://github.com/telnet23
30 REM 20 November 2020
40 CLS
50 MAXK = 32
60 MINRE = -2.5
70 MAXRE = 1.5
80 MINIM = -1.5
90 MAXIM = 1.5
100 FOR X = 1 TO WIDTH
110 FOR Y = 1 TO HEIGHT
120 LOCATE Y, X
130 REC = MINRE + (MAXRE - MINRE) / (WIDTH - 1) * (X - 1)
140 IMC = MINIM + (MAXIM - MINIM) / (HEIGHT - 1) * (Y - 1)
150 K = 0
160 REF = 0
170 IMF = 0
180 K = K + 1
190 REF = REC + REF * REF - IMF * IMF
200 IMF = IMC + REF * IMF + REF * IMF
210 IF REF * REF + IMF * IMF > 4 THEN GOTO 230
220 IF K < MAXK THEN GOTO 180
230 M = 40 + INT(8 / MAXK * (K - 1))
240 PRINT CHR$(27) + "[" + STR$(M) + "m";
250 PRINT " ";
260 PRINT CHR$(27) + "[49m";
270 NEXT Y
280 NEXT X
Why isn't it just increments in 1? That would make more sense.

The short answer is that BASIC numbering is in increments of one, but programmers can and do skip some of the increments. BASIC grew out of Fortran, which also used numeric labels, and often used increments of 10. Unlike Fortran, early BASIC required numbering all lines, so that they changed from labels to line numbers.
BASIC is numbered in increments greater than one to allow adding new lines between existing lines.
Most early home computer BASIC implementations did not have a built-in means of renumbering lines.
Code execution in BASIC implementations with line numbers happened in order of line number.
This meant that if you wanted to add new lines, you needed to leave numbers free between those lines. Even on computers with a RENUM implementation, renumbering could take time. So if you wanted standard increments you’d still usually only RENUM at the end of a session or when you thought you were mostly finished.
Speculation: Programmers use increments of 10 specifically for BASIC line numbers for at least two reasons. First, tradition. Fortran code from the era appears to use increments of 10 for its labels when it uses any standard increments at all. Second, appearance. On the smaller screens of the era it is easier to see where BASIC lines start if they all end in the same symbol, and zero is a very useful symbol for that purpose. Speaking from personal experience, I followed the spotty tradition of starting different routines on hundreds boundaries and thousands boundaries to take advantage of the multiple zeroes at the beginning of the line. This made it easier to recognize the starts of those routines later when reading through the code.
BASIC grew from Fortran, which also used numbers, but as labels. Fortran lines only required a label if they needed to be referred to, such as with a GO TO, to know where a loop can be exited, or as a FORMAT for a WRITE. Such lines were also often in increments greater than 1—and commonly also 10—so as to allow space to add more in between if necessary. This wasn’t technically necessary. Since they were labels and not line numbers, they didn’t need to be sequential. But most programmers made them sequential for readability.
In his commonly-used Fortran 77 tutorial, Erik Boman writes:
Typically, there will be many loops and other statements in a single program that require a statement label. The programmer is responsible for assigning a unique number to each label in each program (or subprogram). The numerical value of statement labels have no significance, so any integer numbers can be used. Typically, most programmers increment labels by 10 at a time.
BASIC required that all lines have numbers and that the line numbers be sequential; that was part of the purpose of having line numbers: a BASIC program could be entered out of order. This allowed for later edits. Thus, line 15 could be added after lines 10 and 20 had been added. This made leaving potential line numbers between existing line numbers even more useful.
If you look at magazines with BASIC program listings, such as Rainbow Magazine or Creative Computing, you’ll often see numbers sandwiched somewhat randomly between the tens. And depending on style, many people used one less than the line number at the start of a routine or subroutine to comment the routine. Routines and DATA sections might also start on even hundreds or even thousands.
Programmers who used conventions like this might not even want to renumber a program, as it would mess up their conventions. BASIC programs were often a mass of text; any convention that improved readability was savored.
Ten was a generally accepted spacing even before the home computer era. In his basic basic, second edition (1978, and expecting that the user would be using “a remote terminal”), James S. Coan writes (page 2):
It is conventional although not required to use intervals of 10 for the numbers of adjacent lines in a program. This is because any modification in the program must also have line numbers. So you can use the in-between numbers for that purpose. It should be comforting to know at this point that the line numbers do not have to be typed in order. No matter what order they are typed in, the computer will follow the numerical order in executing the program.
There are examples of similar patterns in Coan’s Basic Fortran. For example, page 46 has a simple program to “search for pythagorean triples”; while the first label is 12, the remaining labels are 20, 30, and 40, respectively.
He used similar patterns without increments of 10; for example, on page 132 of Basic Fortran, Coan uses increments of 2 for his labels, and keeps the calculation section of the program in the hundreds with the display section of the program in the two hundreds. The END statement uses label 9900.
Similarly, in their 1982 Elementary BASIC, Henry Ledgard and Andrew Singer write (page 27):
Depending on the version of Basic you are using, a line number can consist of 1 to 4 or 5 digits. Here, all line numbers will consist of 4 digits, a common practice accepted by almost every version of Basic. The line numbers must be in sequential order. Increasing line numbers are often given in increments of 10, a convention we will also follow. This convention allows you to make small changes to a program without changing all the line numbers.
And Jerald R. Brown’s 1982 Instant BASIC: 2nd Astounding Edition (p. 7):
You don’t have to enter or type in a program in line number order. That is, you don’t have to enter line 10 first, then line 20, and then line 30. If we type in a program out of line number order, the computer doesn’t care. It follows the line numbers not the order they were entered or typed in. This makes it easy to insert more statements in a program already stored in the computer’s memory. You may have noticed how we cleverly number the statements in our programs by 10's. This makes it easy to add more statements between the existing line numbers -- up to nine more statements between lines 10 and 20, for example.
Much of the choice of how to number lines in a BASIC program was based on tradition and a vague sense of what worked. This was especially true in the home computer era where most users didn’t take classes on how to use BASIC but rather learned by reading other people’s programs, typing them in from the many books and magazines that provided program listings. The tradition of incrementing by 10 and inserting new features between those increments was an obvious one.
You can see it scanning through old books of code, such as 101 BASIC Computer Games. The very first program, “Amazin” increments its line numbers by 10. But at some point, a user/coder decided they needed an extra space after the code prints out how many dollars the player has; so that extra naked PRINT is on line 195. And the display of the instructions for the game are all kept between lines 100 and 109, another common pattern.
The program listing on page 30 for Basket displays the common habit of starting separate routines at even hundreds and thousands. Line numbers within those routines continue to increment by 10. The pattern is fairly obvious even though new features (and possibly other patterns) have added several lines outside the pattern.
As BASIC implementations began to get RENUM commands, more BASIC code listings appeared with increments of one. This is partly because using an increment of one used less memory. While the line number itself used a fixed amount of RAM (with the result that the maximum line number was often somewhere around FFFF, or 65525), references to line numbers did not tend to use a fixed length. Thus, smaller line numbers used less RAM overall.
Depending on how large the program was, and how much branching it used, this could be significant compared to the amount of RAM the machine itself had.
For example, I recently typed in the SKETCH.BAS program from the October 1984 Rainbow Magazine, page 97. This is a magazine, and a program, for the TRS-80 Color Computer. This program uses increments of 1 for its line numbering. On CLOADing the program in, free memory stands at 17049. After using RENUM 10,1,10 to renumber it in increments of 10, free memory stands at 16,953.
A savings of 96 bytes may not sound like much, but this is a very small program; and it’s still half a percent of available RAM. The difference could be the difference between a program fitting into available RAM or not fitting. This computer only has 22823 bytes of RAM free even with no program in memory at all.

Detect fake random numbers?

My client side code generates UUIDs and sends them to the server.
For example, '6ea140caa83b485f9c98ebaacfb536ce' would be a valid uuid4 to send back.
Is there any way to detect or prevent a user sending back a valid but "user generated" uuid4 like 'babebabebabe4abebabebabebabebabe'?
For example, one way to prevent a certain class of these would be looking at the number of occurrences of 0's and 1's in the binary representation of the number. This could work for a string like '00000000000040000000000000000000' but not for all strings.

It depends a little ...
there is no way to be entirely sure, but depending on the UUID version/subtype you are using there MIGHT be a way to detect at least some irregular values:
https://www.rfc-editor.org/rfc/rfc4122#section-4.1 defines the original version 1 of UUIDs, and a layout for the uuid fields ...
you could for example check if the version and variant fields are valid...
if your UUID generation actually uses Version 1 you could, in addition to the first test of version and variant, test if the timestamp is in a valid range ... for example, it might be unlikely that the UUID in question was generated in the year 1600 ... or in the future
so tests like there could be applied to check if the value actually makes sense, or is complete gibberish ... it can not protect you against someone thinking: ok ... lets analyze this and provide a manually choosen value that satisfies all conditions

No there is no way to distinguish user generated UUID's from randomly generated UUID's.
To start with, a user generated UUID may as well be partially random. But lets assume that it is not.
In that case you want to detect a pattern. However, although you give an example of a pattern, a pattern can be almost anything. For instance, the following byte array looks completely random, right?
40 09 21 fb 54 44 2d 18
But actually it is a nothing-up-my-sleeve number usually used within the cryptographic community: it's simply the encoding of Pi (in this case as a 64 bit floating point, as I was somewhat lazy).
There are certainly randomness tests, for instance FIPS random number tests. Those require a very high number of input to see if something fails or succeeds. Even then: it only shows that certain statistical properties have indeed been attained by a random number generator. The encoding of Pi might very well succeed.
And annoyingly, a random number generator is perfectly possible to generate bit strings that do not look random at all, if just by chance. The smaller the bit string the more chance of the random number generator generating something that doesn't look random at all. And UUID's are not all that big.
So yes, of course you can do some tests, but you can never be sure: you will have both false positives as false negatives.

Keep only observations with data for all variables

I have a dataset with several variables. I want to create a subsample which only includes the observations which have data for all variables, so no missing data in any of the variable.
I know about the dropmiss command in Stata but that does not apply here because I do not want to drop variables, but I want to drop the observations.
I found a question similar to mine in Stack Overflow, but the statistical program used there is SAS and I am using Stata.
(SAS - Keeping only observations with all variables).
An example (the "." is a missing data):
ID year pension age gender
1 2006 300 54 F
2 2007 250 40 M
3 2006 . 45 M
4 2005 . . F
So in this case I only want to keep ID 1 and 2, and drop 3 and 4 from the sample since it contains missing data for some of the variables.

The statement about dropmiss (download from Stata Journal website after search dropmiss) is incorrect.
dropmiss has an obs option geared to this need.
. sysuse auto, clear
(1978 Automobile Data)
. dropmiss, obs
(0 observations deleted)
. dropmiss, obs any
(5 observations deleted)
However, dropmiss is considered by its author (that's me) to be superseded by missings (download similarly from Stata Journal website). missings doesn't support this directly, as considering whether missing values can be handled by multiple imputation is widely considered better statistical practice.
But if you insist, missings can help with this too:
. sysuse auto, clear
(1978 Automobile Data)
. missings tag, gen(anymiss)
Checking missings in all variables:
5 observations with missing values
. drop if anymiss
(5 observations deleted)
There's an egen function rowmiss() that behaves similarly.
What's key here is that you don't need to spell out the variable names concerned. However, watch out: these commands can be highly destructive.

The answer is pretty simple assuming you have a limited number of variables.
Just type:
keep if !missing(var1) & !missing(var2) & !missing(var3)
That command will only keep rows that have non-missing values of all of the three variables mentioned above. Feel free to add more.

SUM not working 'Invalid or missing field format'

I have an input file in this format: (length 20, 10 chars and 10 numerics)
jname1 0000500006
bname1 0000100002
wname1 0000400007
yname1 0000000006
jname1 0000100001
mname1 0000500012
mname2 0000700013
In my jcl I have defined my sysin data as such:
SYSIN DATA *
SORT FIELDS=(1,1,CH,A)
SUM FIELDS=(11,10,FD)
DATAEND
*
It works fine as long as I don't add the sum fields so I'm wondering if I'm using the wrong format for my numerics seeing as I know they start at field 11 and have a length of 10 the format is the only thing that could be wrong.
As you might have already realised the point of this JCL is to just list the values but grouped by the first letter of the name (so for the example data and JCL I have given it would group the numeric for mname1 and mname2 together but leave the other records untouched).
I'm kind of new at this so I was wonder what I need for the format if my numerics are like that in the input file.

If new to DFSORT, get hold of the DFSORT Getting Started guide for your version of DFSORT (http://www-01.ibm.com/support/docview.wss?uid=isg3T7000080).
This takes your through all the basic operations with many examples.
The DFSORT Application Programming Guide describes everything you need to know, in detail. Again with examples. Appendix C of that document contains all the data-types available (note, when you tried to use FD, FD is not valid data-type, so probably a typo). There are Tables throughout the document listing what data-types are available where, if there is a particular limit.
For advanced techniques, consult the DFSORT Smart Tricks publication here: http://www-01.ibm.com/support/docview.wss?uid=isg3T7000094
You need to understand a bit more the way data is stored on a Mainframe as well.
Decimals (which can be "packed-decimal" or "zoned-decimal") do not contain a decimal-point. The decimal-point is implied. In high-level languages you tell the compiler where the decimal-point is (in a fixed position) and the compiler does the alignments for you. In Assembler, you do everything yourself.
Decimals are 100% accurate, as there are machine-instructions which act directly on packed-decimal data giving packed-decimal results.
A field which actually contains a decimal-point, cannot be directly used in arithmetic.
An unsigned field is treated as positive when used in any arithmetic.
The SUM statement supports a limited number of numeric definitions, and you have chosen the correct one. It does not matter that your data is unsigned.
If the format of the output from SUM is not what you want, look at OPTION ZDPRINT (or NOZDPRINT).
If you want further formatting, you can use OUTREC or OUTFIL.
As an option to using SUM, you can use OUTFIL reporting functions (especially, although not limited to, if you want a report). You can use SECTIONS and TRAILER3 with TOT/TOTAL.
Something to watch for with SUM (which is not a problem with the reporting features) is if any given one (or more) of your SUMmed fields exceed the field size. To continue to use SUM if that happens, you need to extend the field in INREC and then get SUM to use the new, sufficient, size.

After some trial and error I finally found it, appearantly the format I needed to use was the ZD format (zoned decimal, signed), so my sysin becomes this:
SYSIN DATA *
SORT FIELDS=(1,1,CH,A)
SUM FIELDS=(11,10,ZD)
DATAEND
*
even though my records don't contain any decimals and they are unsigned, I don't really get it so if someone knows why it's like that please go ahead and explain it to me.
For now the way I'm going to remember it is this: Z = symbol for real (meaning integers so no decimals)

J2ME Hashtable comparison

I'm dealing with a little bit issue when I try to compare two hashtables in J2ME.
This is the situation:
Fist, I've two hashtables:
parkingSlot(String SlotId, String Vehicle)
vehicles(vehicleID,"Available");
is it possible to find this?
parkingSlot
01 "Available"
02 XSD123
03 ASD423
04 "Available"
05 "Available"
vehicules
XSD123 "Available"
LAE212 "Available"
EDO987 "Available"
ADE934 "Available"
ASD423 "Available"
I need to get the car plates that exist in both hashtables. I tried using two iteration with Enumeration adding the values from the first hashtable and the keys from the second one to another hashtable or making a comparison between each one and I can't
Can someone give me a hand with this? ( I can attach my test code )

Finally I get the solution for this case: I add the content of the first Hashtable to a Vector and the I compared the Vector and the second hashtable deleting the data duplicated

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string