REXX/SORT split sequential file

REXX/SORT split sequential file - mainframe

I need to split a sequential Mainframe file. Well, to be precise I need to copy the content from this file to another, starting from a specific keyword. Example:
line1
line2
line3
start line4
line5
line6
In this case I need to search for "start" and copy everything starting from line4 to another file using either REXX or SORT. Any suggestions?
EDIT: What I thought about but not happy with in REXX
"EXECIO * DISKR INPUT (STEM INPUT. FINIS)"
LINEINPUT = 1
LINEOUTPUT = 1
FOUND = 0 /*working like a boolean?
DO WHILE LINEINPUT <= INPUT.0
IF INPUT.LINEINPUT = start line4 THEN DO
FOUND = 1
END
IF FOUND = 1 THEN DO
INPUT.LINEINPUT = OUTPUT.LINEOUTPUT
LINEOUTPUT = LINEOUTPUT + 1
END
LINEINPUT = LINEINPUT + 1
END
Something like this maybe, but this means that I need to go through all those files line by line. Maybe there is a better way in JCL? Maybe Syncsort can do something like this?
RECFM is fixed. FBA to be precise. LRECL 170. The trigger can either be a part of this line or the whole line. This is not important, because it is always the same line in every single file and even the first few characters are unique in the whole file. Means that "start" only appears once in the file.

For a fixed-position start of the trigger:
OPTION COPY
INREC IFTHEN=(WHEN=GROUP,
BEGIN=(1,5,CH,EQ,C'start'),
PUSH=(171:ID=1))
OUTFIL OMIT=(171,1,CH,EQ,C' '),
BUILD=(1,170)
For a variable-position, unique, trigger:
OPTION COPY
INREC IFTHEN=(WHEN=GROUP,
BEGIN=(1,170,SS,EQ,C'start'),
PUSH=(171:ID=1))
OUTFIL OMIT=(171,1,CH,EQ,C' '),
BUILD=(1,170)
WHEN=GROUP gives you PUSH, which will put data from the current record, or a group-number (ID) or number within group (SEQ), to that position on all the records in the group (including the current record). In this case, the "group" is the rest of the file.
The SS is field-type allowing a Sub-String search.
Then an OMIT= (similar to OMIT COND= but after the file has been processed by the control cards) on OUTFIL does the actual selection from the flag (the automatically extended records not with a value from the PUSH will be set to blank) and BUILD to drop off the extra byte.
The extra byte is needed because SORT has no "program storage" for definitions - extra fields have to be on the records (or, for the duration of the current record only, in PARSEd fields).

Sorry for not commenting as i dont have enough reputation i am writing this in answer. Will remove this later
You can achieve this by both the methods i.e. Through REXX or JCL.
Please show us what you tried till now so we can figure out what to improve.

Related

Best way to identify similar text inside strings?

I've a list of phrases, actually it's an Excel file, but I can extract each single line if needed.
I need to find the line that is quite similar, for example one line can be:
ANTIBRATING SSPIRING JOINT (type 2) mod. GA160 (temp.max60°)
and some line after I can have the same line or this one:
ANTIBRATING SSPIRING JOINT (type 2) mod. GA200 (temp.max60°)
Like you can see these two lines are pretty the same, not equal in this case but at 98%
The main problem is that I've to process about 45k lines, for this reason I'm searching a way to do that in a quick and maybe visual way.
The first thing that came in my mind was to compare the very 1st line to the 2nd then the 3rd till the end, and so on with the 2nd one and the 3rd one till latest-1 and make a kind of score, for example the 1st line is 100% with line 42, 99% with line 522 ... 21% with line 22142 etc etc...
But is only one idea, maybe not the best.
Maybe out there's already a good program/script/online services/program, I searched but I can't find it, so at the end I asked here.
Anyone knows a good way (if this is possible) or script or one online services to achieve this?

One thing you can do is write a script, which does as follows:
Extract data from csv file
Define a regex which can conclude a similarity, a python example can be:
[\w\s]+\([\w]+\)[\w\s]+\([\w°]+\)
Or such, refer the documentation.

The problem you have is that you are not looking for an exact match, but a like.
This is a problem even databases have never solved and results in a full table scan.
So we're unlikely to solve it.
However, I'd like to propose that you consider alternatives:
You could decide to limit the differences to specific character sets.
In the above example, you were ignoring numbers, but respected letters.
If we can assume that this rule will always hold true, then we can perform a text replace on the string.
ANTIBRATING SSPIRING JOINT (type 2) mod. GA160 (temp.max60°) ==> ANTIBRATING SSPIRING JOINT (type _) mod. GA_ (temp.max_°)
Now, we can deal with this problem by performing an exact string comparison. This can be done by hashing. The easiest way is to feed a hashmap/hashset or a database with a hash index on the column where you will store this adjusted text.
You could decide to trade time for space.
For example, you can feed the strings to a service which will build lots of different variations of indexes on your string. For example, feed elasticsearch with your data, and then perform analytic queries on it.

Fuzzy searches is the key.
I found several projects and ideas, but the one I used is tree-agrep, I know that is quite old but in this case works for me, I created this little script to help me to create a list of differences, so I can manually check it with my file
#!/bin/bash
########## CONFIGURATIONS ##########
original_file=/path/jjj.txt
t_agrep_bin="$(command -v tre-agrep)"
destination_file=/path/destination_file.txt
distance=1
########## CONFIGURATIONS ##########
lines=$(grep "" -c "$original_file")
if [[ -s "$destination_file" ]]; then
rm -rf "$destination_file"
fi
start=1
while IFS= read -r line; do
echo "Checking line $start/$lines"
lista=$($t_agrep_bin -$distance -B --colour -s -n -i "$line" $original_file)
echo "$lista" | awk -F ':' '{print $1}' ORS=' ' >> "$destination_file"
echo >> "$destination_file"
start=$((start+1))
done < "$original_file"

Cannot append to a file: Append replaces the content

The following command does not append but replaces the content
echo 0 >> /sys/block/nvme0n1/queue/nomerges
I don't want to replace but append. But I'm curious Is there something special about this file?
It also doesn't allow more than one character as its input.

Look at https://serverfault.com/questions/865787/what-does-the-nomerge-mean-in-linux-system
It might help you in understanding, that there are only 3 options that the file can take.
Also:
nomerges enables the user to disable the lookup logic involved with IO
merging requests in the block layer. By default (0) all merges are
enabled. When set to 1 only simple one-hit merges will be tried. When
set to 2 no merge algorithms will be tried (including one-hit or more
complex tree/hash lookups).

Performing a sort using k1,1 only

Assume you have an unsorted file with the following content:
identifier,count=Number
identifier, extra information
identifier, extra information
...
I want to sort this file so that for each id, write the line with the count first and then the lines with extra info. I can only use the sort unix command with option -k1,1 but am allowed to slightly change the lines to get this sort.
As an example, take
a,Count=1
a,giulio
aa,Count=44
aa,tango
aa,information
ee,Count=2
bb,que
f,Count=3
b,Count=23
bax,game
f,ee
c,Count=3
c,roma
b,italy
bax,Count=332
a,atlanta
bb,Count=78
c,Count=3
The output should be
a,Count=1
a,atlanta
a,giulio
aa,Count=44
aa,information
aa,tango
b,Count=23
b,italy
bax,Count=332
bax,game
bb,Count=78
bb,que
c,Count=3
c,roma
ee,Count=2
f,Count=3
f,ee
but I get:
aa,Count=44
aa,information
aa,tango
a,atlanta
a,Count=1
a,giulio
bax,Count=332
bax,game
bb,Count=78
bb,que
b,Count=23
b,italy
c,Count=3
c,Count=3
c,roma
ee,Count=2
f,Count=3
f,ee
I tried adding spaces at the end of the identifier and/or at the beginning of the count field and other characters, but none of these approaches work.
Any pointer on how to perform this sorting?
EDIT:
if you consider for example the products with id starting with a, one of them has info 'atlanta' and appears before Count (but I wand Count to appear before any information). In addition, bb should be after b in alphabetical order for the ids. To make my question clearer: How can I get the IDs sorted by alphabetical order and such that for a given ID, the line with Count appears before the others. And how to do this using sort -k1,1 (This is a group project I am working on and I am not free to change the sorting command) and maybe changing the content (I tried for example adding a '~' to all the infos so that Count is before)

you need to tell sort, that comma is used as field separator
sort -t, -k1,1
For ASCII sorting make sure LC_ALL=C and LANG and LANGUAGE are unset

Start loop at specific line of text file in groovy

I am using groovy and I am trying to have a text file be altered at specific line, without looping through all of the previous lines. Is there a way to state the line of a text file that you want to wish to alter?
For instance
Text file is:
1
2
3
4
5
6
I would like to say
Line(3) = p
and have it change the text file to:
1
2
p
4
5
6
I DO NOT want to have to do a loop to iterate through the lines to change the value, aka I do not want to use a .eachline {line ->...} method.
Thank you in advance, I really appreciate it!

I dont think you can skip lines and traverse like this. You could do the skip by using the Random Access File in java, but instead of lines you should be specifying the number of bytes.

Try using readLines() on file text. It will store all your lines in a list. To change content at line n, change content at n-1 index on list and then join on list items.
Something like this will do
//We can call this the DefaultFileHandler
lineNumberToModify = 3
textToInsert = "p"
line( lineNumberToModify, textToInsert )
def line(num , text){
list = file.readLines()
list[num - 1] = text
file.setText(list.join("\n"))
}
EDIT: For extremely large files, it is better that you have a custom implementation. May be something on the lines of what Tim Yates had suggested in the comment on your question.
The above readLines() can easily process upto 100000 lines of text within less than a sec. So you can do something like this:
if(file size < 10 MB)
use DefaultFileHandler()
else
use CustomFileHandler()
//CustomFileHandler
- Split the large file into buckets of acceptable size.
- Ex: Bucket 1(1-100000 lines), Bucket 2(100000-200000 lines), etc.
- if (lineNumberToModify falls in bucket range)
insert into line in the bucket
There is no hard and fast rule to define how you implement your CustomFileHandler as it completely depends on the use case scenario. If you need to do the above operation multiple times on the same file, you can choose to do the complete bucket split first, store them in memory and use the buckets for the following operations. Or if it is a one time operation, you can avoid manipulating all the buckets first but deal with only what you need and process the others later on on-demand basis.
And even within the buckets you can define your own intelligence to speed up your job. Say if you want to insert into 99999 line of a bucket with 1-100000 lines, you can exploit groovy's methods and closures to their fullest,
file.readLines().reverse()[1] = "some text"

Filename manipulation in cygwin

I am running cygwin on Windows 7. I am using a signal processing tool and basically performing alignments. I had about 1200 input files. Each file is of the format given below.
input_file_ format = "AC_XXXXXX.abc"
The first step required building some kind of indexes for all the input files, this was done with the tool's build-index command and now each file had 6 indexes associated with it. Therefore now I have about 1200*6 = 7200 index files. The indexes are of the form given below.
indexes_format = "AC_XXXXXX.abc.1",
"AC_XXXXXX.abc.2",
"AC_XXXXXX.abc.3",
"AC_XXXXXX.abc.4",
"AC_XXXXXX.abc.rev.1",
"AC_XXXXXX.abc.rev.1"
Now, I need to use these indexes to perform the alignment. All the 6 indexes of each file are called together and the final operation is done as follows.
signal-processing-tool ..\path-to-indexes\AC_XXXXXX.abc ..\Query file
Where AC_XXXXXX.abc is the index associated with that particular index file. All 6 index files are called with **AC_XXXXXX.abc*.
My problem is that I need to use only the first 14 characters of the index file names for the final operation.
When I use the code below, the alignment is not executed.
for file in indexes/*; do ./tool $file|cut -b1-14 Project/query_file; done
I'd appreciate help with this!

First of all, keep in mind that $file will always start with "indexes/", so trimming first 14 characters would always include that folder name in the beginning.
To use first 14 characters in a variable, use ${file:0:14}, where 0 is the starting string index, and 14 is the length of the desired substring.
Alternatively, if you want to use cut, you need to run it in a subshell: for file in indexes/*; do ./tool $(echo $file|cut -c 1-14) Project/query_file; done I changed the arg for cut to -c for characters instead of bytes

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

REXX/SORT split sequential file - mainframe

Sorry for not commenting as i dont have enough reputation i am writing this in answer. Will remove this later You can achieve this by both the methods i.e. Through REXX or JCL. Please show us what you tried till now so we can figure out what to improve.

Related

Best way to identify similar text inside strings?

Cannot append to a file: Append replaces the content

Performing a sort using k1,1 only

Start loop at specific line of text file in groovy

Filename manipulation in cygwin

Categories

Resources