Remove multiple header from ps file - mainframe

I have a ps file in which I want to remove header if there is no data below it i.e . if there are headers(recognized by FIRST 3 letter HDR) in two consecutive line, I want to remove the first one as there is no data for it.
Input data
HDR20170123
HDR20170124
1.8988 ABCD
1.4324 PARE
HDR20170125
1.5432 URST
Desired Output
HDR20170124
1.8988 ABCD
1.4324 PARE
HDR20170125
1.5432 URST
Is there anyway using dfsort , we can do this ?

There are two techniques, and the JOINKEYS technique is easier to explain in a short space of time.
You use JOINKEYS, with your data set name for both input files.
You define JNFnCNTL data sets for both inputs, and in each of those you append a sequence number to each record. One (JNFCNTL1) sequence number you start from zero, the other (JNFCNTL2) you start from one. The sequence numbers need to be large enough to for the number of your records to be expressed.
The JOINKEYS key you make the sequence-numbers on the files.
Use JOIN UNPAIRED,F2 (which will get you matches, and unmatched on F2).
REFORMAT of F1:1,3:F2:1,80
OMIT with COND= for the main task, where you get rid of the records where 1,3 is HDR and 1,3 matches 4,3 (previous record was a header).
Then BUILD=(4,80) in the main task, to get rid of the first three bytes from the previous record.
On the inputs your data will look like this, offset to represent the sequence numbers:
F1 F2
HDR20170123
HDR20170124 HDR20170123
1.8988 ABCD HDR20170124
1.4324 PARE 1.8988 ABCD
HDR20170125 1.4324 PARE
1.5432 URST HDR20170125
1.5432 URST
And on the REFORMAT:
HDR
HDRHDR20170123
1.8HDR20170124
1.41.8988 ABCD
HDR1.4324 PARE
1.5HDR20170125
1.5432 URST
What you've achieved is then availability of data from the previous record (first three bytes, as much as you need for a given case) whilst you have the current record, so testing values to the previous record is easy.
Time for some code now:
//SYSIN DD *
OPTION COPY
JOINKEYS F1=INA,FIELDS=(81,6,A),SORTED,NOSEQCK
JOINKEYS F2=INB,FIELDS=(81,6,A),SORTED,NOSEQCK
JOIN UNPAIRED,F2
REFORMAT FIELDS=(F1:1,3,
F2:1,80)
OMIT COND=(1,3,CH,EQ,C'HDR',
AND,
1,3,CH,EQ,4,3,CH)
INREC BUILD=(4,80)
//JNF1CNTL DD *
INREC OVERLAY=(81:SEQNUM,6,ZD,
START=0)
//JNF2CNTL DD *
INREC OVERLAY=(81:SEQNUM,6,ZD,
START=1)
//INA DD *
HDR20170123
HDR20170124
1.8988 ABCD
1.4324 PARE
HDR20170125
1.5432 URST
//INB DD *
HDR20170123
HDR20170124
1.8988 ABCD
1.4324 PARE
HDR20170125
1.5432 URST
That produces your desired output:
HDR20170124
1.8988 ABCD
1.4324 PARE
HDR20170125
1.5432 URST
A JOINKEYS operation will consist of three "tasks" operating concurrently. The "main task" is an entirely normal SORT step, consisting of whatever you want.
There are two sub-tasks, one for each of the input data sets. Each of those sub-tasks can have further control cards supplied to modify their data. These are specified on JNFnCNTL DDs. JNF1CNTL and JNF2CNTL. You may supply none, either one, or both, as per actual requirement. Here you want both.
JNFnCNTL data sets must only contain a subset of normal control cards. They may not contain OUTREC nor OUTFIL. This is because they inter-operate with the Main Task at exactly the point where OUTREC could otherwise exist.
On the JOINKEYS statement, specify SORTED,NOSEQCK. This is because, by default, the JOINKEYS data sets are sorted (on the key for the match), and the sequence is already guaranteed (and does not need to be checked) because the sequence of the keys is a sequence number.
The REFORMAT statement should only include data that is required in the main task. Here all that is needed is the location where HDR may exist, and the full record from the F2.
JOIN UNPAIRED,F2 will obtain all matched records, and all F2 which do not match (there will only be one, the final record, because the match is on the sequence number, offset by one).
To understand this (or any DSFORT data manipulations further) make amendments to show the data at intermediate stages. Here there is only one stage, so it is simple:
//SYSIN DD *
OPTION COPY
JOINKEYS F1=INA,FIELDS=(81,6,A),SORTED,NOSEQCK
JOINKEYS F2=INB,FIELDS=(81,6,A),SORTED,NOSEQCK
JOIN UNPAIRED,F2
REFORMAT FIELDS=(F1:1,12,81,6,12,1,
F2:1,12,81,6,12,1,
?)
//JNF1CNTL DD *
INREC OVERLAY=(81:SEQNUM,6,ZD,
START=0)
//JNF2CNTL DD *
INREC OVERLAY=(81:SEQNUM,6,ZD,
START=1)
//INA DD *
HDR20170123
HDR20170124
1.8988 ABCD
1.4324 PARE
HDR20170125
1.5432 URST
//INB DD *
HDR20170123
HDR20170124
1.8988 ABCD
1.4324 PARE
HDR20170125
1.5432 URST
Produces this output:
HDR20170124 000001 HDR20170123 000001 B
1.8988 ABCD 000002 HDR20170124 000002 B
1.4324 PARE 000003 1.8988 ABCD 000003 B
HDR20170125 000004 1.4324 PARE 000004 B
1.5432 URST 000005 HDR20170125 000005 B
1.5432 URST 000006 2
Since you only show 11 bytes of data in an 80-byte data, on this REFORMAT statement the first 12 positions are taken (the 12th to leave a blank) and 12,1 is also used as a separator (literals cannot be used in a REFORMAT). The respective sequence numbers are also shown, as is the inbuilt match-marker (the ? (question-mark) in the REFORMAT: B for Both files, 2 for on F2 only (no 1 is shown, as the JOIN statement only asks for matches and unmatched F2).

Related

Is there a way to compare DDMMYYYY format date with current date?

I have a requirement to get today's transactions into a separate file through JCL SORT. The date format that I have is in DDMMYYYY format.
INCLUDE COND=(20,10,CH,EQ,DATE1), Will not work because DATE1 returns the date in C'yyyymmdd'format
Try reformatting the date in input file and then compare it with DATE1. Refer below SORT CARD
----+----1----+----2----+----3----+----4----+-
//SORTIN DD *
DATA1 01102020
DATA2 07102020
DATA3 07102020
DATA4 01092020
DATA5 01102010
DATA6 01102019
/*
//SORTOUT DD SYSOUT=*
//SYSIN DD *
OPTION COPY
INREC BUILD=(1,28,X,25,4,23,2,21,2)
OUTFIL REMOVECC,
BUILD=(1,28),INCLUDE=(30,08,CH,EQ,DATE1)
/*
Output will be:
DATE2 07102020
DATE3 07102020

Power Query: How to delete duplicate characters from a string (eg. xzxxxzzzzxzzzzx-> leave only xz)?

I have a huge table in Power Query with text in cells that consist of multiple 'x's and 'z's. I want to deduplicate values so I have one x and one z only.
For example:
xzzzxxxzxz-> xz
zzzzzzzzzz-> z
The table is very big, so I don't want to create additional columns. Can you please help?
You can convert the string to a list of characters, make the list distinct (remove duplicates), sort (if desired), and then transform back to text.
= Table.TransformColumns(#"Previous Step", {{"ColumnName",
each Text.Combine( List.Sort( List.Distinct( Text.ToList(_) ) ) ),
type text}})

Join two DataFrames where the join key is different and only select some columns

What I would like to do is:
Join two DataFrames A and B using their respective id columns a_id and b_id. I want to select all columns from A and two specific columns from B
I tried something like what I put below with different quotation marks but still not working. I feel in pyspark, there should have a simple way to do this.
A_B = A.join(B, A.id == B.id).select(A.*, B.b1, B.b2)
I know you could write
A_B = sqlContext.sql("SELECT A.*, B.b1, B.b2 FROM A JOIN B ON A.a_id = B.b_id")
to do this but I would like to do it more like the pseudo code above.
Your pseudocode is basically correct. This slightly modified version would work if the id column existed in both DataFrames:
A_B = A.join(B, on="id").select("A.*", "B.b1", "B.b2")
From the docs for pyspark.sql.DataFrame.join():
If on is a string or a list of strings indicating the name of the join
column(s), the column(s) must exist on both sides, and this performs
an equi-join.
Since the keys are different, you can just use withColumn() (or withColumnRenamed()) to created a column with the same name in both DataFrames:
A_B = A.withColumn("id", col("a_id")).join(B.withColumn("id", col("b_id")), on="id")\
.select("A.*", "B.b1", "B.b2")
If your DataFrames have long complicated names, you could also use alias() to make things easier:
A_B = long_data_frame_name1.alias("A").withColumn("id", col("a_id"))\
.join(long_data_frame_name2.alias("B").withColumn("id", col("b_id")), on="id")\
.select("A.*", "B.b1", "B.b2")
Try this solution:
A_B = A.join(B,col('B.id') == col('A.id')).select([col('A.'+xx) for xx in A.columns]
+ [col('B.other1'),col('B.other2')])
The below lines in SELECT played the trick of selecting all columns from A and 2 columns from Table B.
[col('a.'+xx) for xx in a.columns] : all columns in a
[col('b.other1'),col('b.other2')] : some columns of b
I think the easier solution is just to join table A to table B with selected columns you want. here is a sample code to do this:
joined_tables = table_A.join(table_B.select('col1', 'col2', 'col3'), ['id'])
the code above join all columns from table_A and columns "col1", "col2", "col3" from table_B.

How to optimize a join?

I have a query to join the tables. How do I optimize to run it faster?
val q = """
| select a.value as viewedid,b.other as otherids
| from bm.distinct_viewed_2610 a, bm.tets_2610 b
| where FIND_IN_SET(a.value, b.other) != 0 and a.value in (
| select value from bm.distinct_viewed_2610)
|""".stripMargin
val rows = hiveCtx.sql(q).repartition(100)
Table descriptions:
hive> desc distinct_viewed_2610;
OK
value string
hive> desc tets_2610;
OK
id int
other string
the data looks like this:
hive> select * from distinct_viewed_2610 limit 5;
OK
1033346511
1033419148
1033641547
1033663265
1033830989
and
hive> select * from tets_2610 limit 2;
OK
1033759023
103973207,1013425393,1013812066,1014099507,1014295173,1014432476,1014620707,1014710175,1014776981,1014817307,1023740250,1031023907,1031188043,1031445197
distinct_viewed_2610 table has 1.1 million records and i am trying to get similar id's for that from table tets_2610 which has 200 000 rows by splitting second column.
for 100 000 records it is taking 8.5 hrs to complete the job with two machines
one with 16 gb ram and 16 cores
second with 8 gb ram and 8 cores.
Is there a way to optimize the query?
Now you are doing cartesian join. Cartesian join gives you 1.1M*200K = 220 billion rows. After Cartesian join it filtered by where FIND_IN_SET(a.value, b.other) != 0
Analyze your data.
If 'other' string contains 10 elements in average then exploding it will give you 2.2M rows in table b. And if suppose only 1/10 of rows will join then you will have 2.2M/10=220K rows because of INNER JOIN.
If these assumptions are correct then exploding array and join will perform better than Cartesian join+filter.
select distinct a.value as viewedid, b.otherids
from bm.distinct_viewed_2610 a
inner join (select e.otherid, b.other as otherids
from bm.tets_2610 b
lateral view explode (split(b.other ,',')) e as otherid
)b on a.value=b.otherid
And you do not need this :
and a.value in (select value from bm.distinct_viewed_2610)
Sorry I cannot test query, do it yourself please.
If you are using orc formate change to parquet as per your data i woud say choose range partition.
Choose proper parallization to execute fast.
I have answred on follwing link may be help you.
Spark doing exchange of partitions already correctly distributed
Also please read it
http://dev.sortable.com/spark-repartition/

Broadcasting in spark streaming

How can I broadcast a dstream computed over a window? For instance, for the last 10 minute I find the subset of lines satisfying a condition (call it send_events dstream). I need to find a set of lines satisfying another condition (call it ack_events_for_send_events dstream) in the last 10 minutes using the send_events dstream. I do not want to groupbykey due to large shuffling. When I do groupbykey, the size of each group is very small like at most 10. In other words, I have lots of groups (I am not sure if this helps to optimize my operations. Just wanted to share.)
Example:
id1, type1, time1
id1, type2, time3
id2, type1, time5
id1, type1, time2
id2, type2, time4
id1, type2, time6
I want to find the minimum time difference between type1 and type2 per id. Each id has at most 10 lines, but I have 10,000 ids in a given window
Maybe the following would work?
yourDStream.foreachRDD(somefunc)
Then in somefunc:
def somefunc(rdd):
broadcastedList=sc.broadcast(rdd.collect())

Resources