How to join records in Easytrieve internal SORT? - mainframe

I've a requirement, where I need to extract 2 types of records from a single input file & join them for EZT report processing.
Currently, I've written an ICETOOL step to perform the extraction followed by the join. The output of the ICETOOL step is fed to the Easytrieve report step.
Extraction card is as below -
SORT FIELDS=(14,07,PD,A)
OUTFILE FNAMES=FILE010,INCLUDE=(25,03,CH,EQ,C'010')
OUTFILE FNAMES=FILE011,INCLUDE=(25,04,CH,EQ,C'011')
OPTION DYNALLOC=(SYSDA,05)
Here is the join card -
SORT FIELDS=(14,07,PD,A)
JOINKEYS F1=FILE010,FIELDS=(14,07,A),SORTED,NOSEQCHK
JOINKEYS F2=FILE011,FIELDS=(14,07,A),SORTED,NOSEQCHK
REFORMAT FIELDS=(F1:14,07,
F2,25,10)
OUTREC BUILD=(1,17,80:X),VTOF
OPTION DYNALLOC=(SYSDA,05)
I'm wondering if it was possible to perform the above SORT/ICETOOL operations within EasyTrive. I've used the Easytrieve internal SORT but it was for the simple extractions. Can we perform the join operation within the Easytrieve?
Note - The idea is to have a single EZT step.

You can make use Synchronized File Processing facility (SFP) in Easytrieve to acheive the task. Read more about it here.
FILE FILE010
KEY1 14 7 N
*
FILE FILE011
KEY2 14 7 N
FIELD1 25 10 A
*
FILE OUTFILE FB(80 0)
OKEY 1 7 N
OFIELD 8 10 A
*
WS-COUNT W 5 N VALUE 0
*
JOB INPUT FILE010 KEY KEY1 FILE011 KEY KEY2 FINISH(DIS)
*
IF EOF FILE010
STOP
END-IF
*
IF MATCHED
OKEY = KEY1
OFIELD = FIELD1
WS-COUNT = WS-COUNT + 1
PUT OUTFILE
END-IF
*
DIS. PROC
DISPLAY 'RECORDS WRITTEN: ' WS-COUNT
END-PROC
Please note,
Above code isn't tested, it's just a draft showing the idea on
file matching using Easytrieve to achieve the task.
Data types to the data items are assumed. You may have to change them suitably.
You may have to define the variable input datasets in the FILE
statement.
You may add more statements within the IF MATCHED condition for the
creation of report.
Hope this helps!

Related

Why does my PySpark regular expression not give more than the first row?

Taking inspiration from this answer: https://stackoverflow.com/a/61444594/4367851 I have been able to split my .txt file into columns in a Spark DataFrame. However, it only gives me the first game - even though the sample .txt file contains many more.
My code:
basefile = spark.sparkContext.wholeTextFiles("example copy 2.txt").toDF().\
selectExpr("""split(replace(regexp_replace(_2, '\\\\n', ','), ""),",") as new""").\
withColumn("Event", col("new")[0]).\
withColumn("White", col("new")[2]).\
withColumn("Black", col("new")[3]).\
withColumn("Result", col("new")[4]).\
withColumn("UTCDate", col("new")[5]).\
withColumn("UTCTime", col("new")[6]).\
withColumn("WhiteElo", col("new")[7]).\
withColumn("BlackElo", col("new")[8]).\
withColumn("WhiteRatingDiff", col("new")[9]).\
withColumn("BlackRatingDiff", col("new")[10]).\
withColumn("ECO", col("new")[11]).\
withColumn("Opening", col("new")[12]).\
withColumn("TimeControl", col("new")[13]).\
withColumn("Termination", col("new")[14]).\
drop("new")
basefile.show()
Output:
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
| Event| White| Black| Result| UTCDate| UTCTime| WhiteElo| BlackElo| WhiteRatingDiff| BlackRatingDiff| ECO| Opening| TimeControl| Termination|
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
|[Event "Rated Cla...|[White "BFG9k"]|[Black "mamalak"]|[Result "1-0"]|[UTCDate "2012.12...|[UTCTime "23:01:03"]|[WhiteElo "1639"]|[BlackElo "1403"]|[WhiteRatingDiff ...|[BlackRatingDiff ...|[ECO "C00"]|[Opening "French ...|[TimeControl "600...|[Termination "Nor...|
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
Input file:
[Event "Rated Classical game"]
[Site "https://lichess.org/j1dkb5dw"]
[White "BFG9k"]
[Black "mamalak"]
[Result "1-0"]
[UTCDate "2012.12.31"]
[UTCTime "23:01:03"]
[WhiteElo "1639"]
[BlackElo "1403"]
[WhiteRatingDiff "+5"]
[BlackRatingDiff "-8"]
[ECO "C00"]
[Opening "French Defense: Normal Variation"]
[TimeControl "600+8"]
[Termination "Normal"]
1. e4 e6 2. d4 b6 3. a3 Bb7 4. Nc3 Nh6 5. Bxh6 gxh6 6. Be2 Qg5 7. Bg4 h5 8. Nf3 Qg6 9. Nh4 Qg5 10. Bxh5 Qxh4 11. Qf3 Kd8 12. Qxf7 Nc6 13. Qe8# 1-0
[Event "Rated Classical game"]
.
.
.
Each game starts with [Event so I feel like it should be doable as the file has repeating structure, alas I can't get it to work.
Extra points:
I don't actually need the move list so if it's easier they can be deleted.
I only want the content of what is inside the " " for each new line once it has been converted to a Spark DataFrame.
Many thanks.
wholeTextFiles reads each file into a single record. If you read only one file, the result will a RDD with only one row, containing the whole text file. The regexp logic in the question returns only one result per row and this will be the first entry in the file.
Probably the best solution would be to split the file at the os level into one file per game (for example here) so that Spark can read the multiple games in parallel. But if a single file is not too big, splitting the games can also be done within PySpark:
Read the file(s):
basefile = spark.sparkContext.wholeTextFiles(<....>).toDF()
Create a list of columns and convert this list into a list of column expressions using regexp_extract:
from pyspark.sql import functions as F
cols = ['Event', 'White', 'Black', 'Result', 'UTCDate', 'UTCTime', 'WhiteElo', 'BlackElo', 'WhiteRatingDiff', 'BlackRatingDiff', 'ECO', 'Opening', 'TimeControl', 'Termination']
cols = [F.regexp_extract('game', rf'{col} \"(.*)\"',1).alias(col) for col in cols]
Extract the data:
split the whole file into an array of games
explode this array into single records
delete the line breaks within each record so that the regular expression works
use the column expressions defined above to extract the data
basefile.selectExpr("split(_2,'\\\\[Event ') as game") \
.selectExpr("explode(game) as game") \
.withColumn("game", F.expr("concat('Event ', replace(game, '\\\\n', ''))")) \
.select(cols) \
.show(truncate=False)
Output (for an input file containing three copies of the game):
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+
|Event |White|Black |Result|UTCDate |UTCTime |WhiteElo|BlackElo|WhiteRatingDiff|BlackRatingDiff|ECO|Opening |TimeControl|Termination|
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+
|Rated Classical game |BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
|Rated Classical game2|BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
|Rated Classical game3|BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+

Looping through column to perform amalysis on each

I have over 338 columns with different drug name in each column. what I want to do is to loop through all the columns using the code. This code is for one specific drug. The problem is I have 338 different drug names. The code is:
NPTESTS
/INDEPENDENT TEST (PLIN3 CYFIP2 IL2RA HSD3B1 IL2RB PYROXD1 ZBED4 MCTP1 LAMA3 CTSC EDEM1 LIF PIM3
PPARA SLC6A11 THNSL2 ZNF697) GROUP (drug_1) MANN_WHITNEY KRUSKAL_WALLIS(COMPARE=PAIRWISE)
/MISSING SCOPE=ANALYSIS USERMISSING=EXCLUDE
/CRITERIA ALPHA=0.05 CILEVEL=95
Is there any way I can loop through the columns and perform the test without running the code block over and over again?
One thing you can try is restructuring the file to long format so all the drugs are in one column, then you can run the test on all the drugs in parallele at once by splitting the file:
varstocases /make drugval from drug_1 to drug_338/index=drugname(drugval).
sort cases by drugname.
split file by drugname.
*now your code .
NPTESTS
/INDEPENDENT TEST (PLIN3 CYFIP2 IL2RA HSD3B1 IL2RB PYROXD1 ZBED4 MCTP1 LAMA3 CTSC EDEM1 LIF PIM3
PPARA SLC6A11 THNSL2 ZNF697) GROUP (drugval) MANN_WHITNEY KRUSKAL_WALLIS(COMPARE=PAIRWISE)
/MISSING SCOPE=ANALYSIS USERMISSING=EXCLUDE
/CRITERIA ALPHA=0.05 CILEVEL=95.
split file off.
Alternatively you can use SPSS macro to loop through all the drugs and test them one by one:
define testdrugs ()
!do !drg=1 !to 338
NPTESTS
/INDEPENDENT TEST (PLIN3 CYFIP2 IL2RA HSD3B1 IL2RB PYROXD1 ZBED4 MCTP1 LAMA3 CTSC EDEM1 LIF PIM3
PPARA SLC6A11 THNSL2 ZNF697) GROUP !concat("(drug_", !drg, ")") MANN_WHITNEY KRUSKAL_WALLIS(COMPARE=PAIRWISE)
/MISSING SCOPE=ANALYSIS USERMISSING=EXCLUDE
/CRITERIA ALPHA=0.05 CILEVEL=95
!doend
!enddefine.
* the macro is defined, now we can call it.
testdrugs .

kdb/q: How to apply a string manipulation function to a vector of strings to output a vector of strings?

Thanks in advance for the help. I am new to kdb/q, coming from a Python and C++ background.
Just a simple syntax question: I have a string with fields and their corresponding values
pp_str: "field_1:abc field_2:xyz field_3:kdb"
I wrote an atomic (scalar) function to extract the value of a given field.
get_field_value: {[field; pp_str] pp_fields: " " vs pp_str; pid_field: pp_fields[where like[pp_fields; field,":*"]]; start_i: (pid_field[0] ss ":")[0] + 1; end_i: count pid_field[0]; indices: start_i + til (end_i - start_i); pid_field[0][indices]}
show get_field_value["field_1"; pp_str]
"abc"
show get_field_value["field_3"; pp_str]
"kdb"
Now how do I generalize this so that if I input a vector of fields, I get a vector of values? I want to input ("field_1"; "field_2"; "field_3") and output ("abc"; "xyz"; "kdb"). I tried multiple approaches (below) but I just don't understand kdb/q's syntax well enough to vectorize my function:
/ Attempt 1 - Fail
get_field_value[enlist ("field_1"; "field_2"); pp_str]
/ Attempt 2 - Fail
get_field_value[; pp_str] /. enlist ("field_1"; "field_3")
/ Attempt 3 - Fail
fields: ("field_1"; "field_2")
get_field_value[fields; pp_str]
To run your function for each you could project the pp_str variable and use each for the others
q)get_field_value[;pp_str]each("field_1";"field_3")
"abc"
"kdb"
Kdb actually has built-in functionality to handle this: https://code.kx.com/q/ref/file-text/#key-value-pairs
q){#[;x](!/)"S: "0:y}[`field_1;pp_str]
"abc"
q)
q){#[;x](!/)"S: "0:y}[`field_1`field_3;pp_str]
"abc"
"kdb"
I think this might be the syntax you're looking for.
q)get_field_value[; pp_str]each("field_1";"field_2")
"abc"
"xyz"

pandas groupby trying to optimse several steps

I've been trying to optimise a bokeh server to calculate live stats by selected country on Covid19.
I found myself repeating a groupby function to calculate new columns and was wondering, having selected the groupby, if I could then apply it in a similar way to .agg() on multiple columns ?
For example:
dfall = pd.DataFrame(db("SELECT * FROM C19daily"))
dfall.set_index(['geoId', 'date'], drop=False, inplace=True)
dfall = dfall.sort_index(ascending=True)
dfall.head()
id date geoId cases deaths auid
geoId date
AD 2020-03-03 70119 2020-03-03 AD 1 0 AD03/03/2020
2020-03-14 70118 2020-03-14 AD 1 0 AD14/03/2020
2020-03-16 70117 2020-03-16 AD 3 0 AD16/03/2020
2020-03-17 70116 2020-03-17 AD 9 0 AD17/03/2020
2020-03-18 70115 2020-03-18 AD 0 0 AD18/03/2020
I need to create new columns based on 'cases' and 'deaths' and applying various functions like cumsum(). Currently I do this the long way
dfall['ccases'] = dfall.groupby(level=0)['cases'].cumsum()
dfall['dpc_cases'] = dfall.groupby(level=0)['cases'].pct_change(fill_method='pad', periods=7)
.....
dfall['cdeaths'] = dfall.groupby(level=0)['deaths'].cumsum()
dfall['dpc_deaths'] = dfall.groupby(level=0)['deaths'].pct_change(fill_method='pad', periods=7)
I tried to optimise the groupby call like this:-
with dfall.groupby(level=0) as gr:
gr = g['cases'].cumsum()...
But the error suggest the class doesn't support this
AttributeError: __enter__
I thought I could use .agg({}) and supply dictionary
g = dfall.groupby(level=0).agg({'cc' : 'cumsum', 'cd' : 'cumsum'})
but that produces another error
pandas.core.base.SpecificationError: nested renamer is not supported
I have plenty of other bits to optimise, I thought this python part would be the easiest and save a few ms!
Could anyone nudge me in the right direction?
To avoid repeating dfall.groupby(level=0) you can just save it in a variable:
gb = dfall.groupby(level=0)
gb_cases = gb['cases']
dfall['ccases'] = gb_cases.cumsum()
dfall['dpc_cases'] = gb_cases.pct_change(fill_method='pad', periods=7)
...
And to run multiple aggregations using a single expression, I think you can use named aggregation. But I have no clue whether it will be more performant or not. Either way, it's better to profile the code and improve the actual bottlenecks.

Foreach message in logstash

I need help, I want to compare 2 or more messages containce kv in logstash
examples :
first message : X < 10=5.4|9=14|36=V|3=9|49=360T_SEP|5=Good|220=p48
second messages : y1 > 8=pap4|10=495|37=d|34=7|49=SEP|220=p48
y2 > 8=pap4|10=495|34=d|34=7|49=SEP|220=p48
iteration 1 : I get two key : 5 and 220
iteration 2 : I check if y1 has not 5 and 220 from x equals 220 to y1 then set in y1 5.
Basically, I want retrieved in each message the key 220 which corresponds to 5
Any Suggestion please.
Unless things have really changed, logstash typically concerns itself with one event at a time. The elapsed filter is one of the only exception where it is considering prior events in the processing.
You could use ruby to create your own cache, or perhaps use the redis inputs and outputs to that effect, but I'd suggest changing the format of the original message to include the data you need.

Resources