Can I compare a key in both a Db2 table and file in mainframe? - mainframe

I have a student id as the primary key in both the db2 table and file which consists of various records in the mainframe. if the key matches in both the db2 table and file then I have to move that record into output file-1. How can i do this?

I second comments expecting OP to show progress, ask specific Qs. At least share input/output examples, failed attempts, type & definition of a problem (JCL, programming, Db2,...). But maybe this is a case of not knowing where to start - not that uncommon in the Mainframe universe. If that's it, then here's me making a bunch of assumptions :)
Db2 data lives in tables. But you can put relevant data into a dataset. Use Db2 utilities like UNLOAD, sample DSNTEP* programs, DSNTIAUL, vendor tools, in-house tools & programs and so on. This would make it a 2 step process in JCL:
{{jobcard}}
//*
//* Get relevant data from Db2 into a dataset
//*
//RUNTEP2 EXEC PGM=IKJEFT01,DYNAMNBR=20
//STEPLIB DD DISP=SHR,DSN={{DB2_HLQ}}.SDSNLOAD
// DD DISP=SHR,DSN={{DB2_HLQ}}.RUNLIB.LOAD
//SYSTSPRT DD SYSOUT=*
//SYSTSIN DD *
DSN SYSTEM({{DB2_SSID}})
RUN PROGRAM(DSNTIAUL) PLAN(DSNTIAUL) PARMS('SQL,10') -
LIB('{{DB2_HLQ}}.RUNLIB.LOAD')
//SYSPRINT DD SYSOUT=*
//SYSUDUMP DD SYSOUT=*
//SYSREC00 DD DSNAME=&&ROWS,DISP=(NEW,PASS),UNIT=SYSDA
//SYSPUNCH DD DUMMY
//SYSIN DD *
Select EMPNO From {{SCHEMA}}.EMP
Where LASTNAME In ('HAAS', 'YOSHIMURA');
/*
//*
//* In case of the sample EMP table, you should end up with 2 EMPNOs in
//* SYSREC00:
//* 000010 <- HAAS
//* 000170 <- YOSHIMURA
//*
//* Use DFSORT to perform a join.
//*
//SORT EXEC PGM=SORT
//SYSOUT DD SYSOUT=*
//SORTJNF1 DD *
000000 spam -
000010 qualifying row 1/2 +
000090 spam -
000170 qualifying row 2/2 +
000900 eggs -
//SORTJNF2 DD DSNAME=&&ROWS,DISP=SHR
//SORTOUT DD SYSOUT=*
//SYSIN DD *
* Control statements for JOINKEYS application
JOINKEYS FILE=F1,FIELDS=(1,6,A)
JOINKEYS FILE=F2,FIELDS=(1,6,A)
REFORMAT FIELDS=(F1:1,70)
* Control statements for main task (joined records)
SORT FIELDS=COPY
/*
Here's a snippet of what you get in SORTOUT:
********************************* TOP OF DATA ***
000010 qualifying row 1/2 +
000170 qualifying row 2/2 +
******************************** BOTTOM OF DATA *
Alternatively, write a script on Mainframe with REXX, Java, Python, Node,... , a DFSORT exit. Or script off-mainframe with whatever you feel most comfortable with that can handle basic database & FTP/SSH/HTTP(zOSMF). You can, say, do it all in Jupyter with Python & ibm_db_dbi. I'm not suggesting you do that, but, assuming this is a general "how" question - it's one of the ways too.
Script approach does not have to go Db2->Dataset->Process->Dataset. Load file data into Db2 instead. Aggregate, join,... with SQL rather than sort, sets, intersections or whatever you'd go for in a script.

Related

Parquet data to AWS Redshift slow

I want to insert data from S3 parquet files to Redshift.
Files in parquet comes from a process that reads JSON files, flatten them out, and store as parquet. To do it we use pandas dataframes.
To do so, I tried two different things. The first one:
COPY schema.table
FROM 's3://parquet/provider/A/2020/11/10/11/'
IAM_ROLE 'arn:aws:iam::XXXX'
FORMAT AS PARQUET;
It returned:
Invalid operation: Spectrum Scan Error
error: Spectrum Scan Error
code: 15001
context: Unmatched number of columns between table and file. Table columns: 54, Data columns: 41
I understand the error but I don't have an easy option to fix it.
If we have to do a reload from 2 months ago the file will only have for example 40 columns, because on that given data we needed just this data but table already increased to 50 columns.
So we need something automatically, or that we can specify the columns at least.
Then I applied another option which is to do a SELECT with AWS Redshift Spectrum. We know how many columns the table have using system tables, and we now the structure of the file loading again to a Pandas dataframe. Then I can combine both to have the same identical structure and do the insert.
It works fine but it is slow.
The select looks like:
SELECT fields
FROM schema.table
WHERE partition_0 = 'A'
AND partition_1 = '2020'
AND partition_2 = '11'
AND partition_3 = '10'
AND partition_4 = '11';
The partitions are already added as I checked using:
select *
from SVV_EXTERNAL_PARTITIONS
where tablename = 'table'
and schemaname = 'schema'
and values = '["A","2020","11","10","11"]'
limit 1;
I have around 170 files per hour, both in json and parquet file. The process list all files in S3 json path, and process them and store in S3 parquet path.
I don't know how to improve execution time, as the INSERT from parquet takes 2 minutes per each partition_0 value. I tried the select alone to ensure its not an INSERT issue, and it takes 1:50 minutes. So the issue is to read data from S3.
If I try to select a non existent value for partition_0 it takes again around 2 minutes, so there is some kind of problem to access data. I don't know if partition_0 naming and others are considered as Hive partitioning format.
Edit:
AWS Glue Crawler table specification
Edit: Add SVL_S3QUERY_SUMMARY results
step:1
starttime: 2020-12-13 07:13:16.267437
endtime: 2020-12-13 07:13:19.644975
elapsed: 3377538
aborted: 0
external_table_name: S3 Scan schema_table
file_format: Parquet
is_partitioned: t
is_rrscan: f
is_nested: f
s3_scanned_rows: 1132
s3_scanned_bytes: 4131968
s3query_returned_rows: 1132
s3query_returned_bytes: 346923
files: 169
files_max: 34
files_avg: 28
splits: 169
splits_max: 34
splits_avg: 28
total_split_size: 3181587
max_split_size: 30811
avg_split_size: 18825
total_retries:0
max_retries:0
max_request_duration: 360496
avg_request_duration: 172371
max_request_parallelism: 10
avg_request_parallelism: 8.4
total_slowdown_count: 0
max_slowdown_count: 0
Add query checks
Query: 37005074 (SELECT in localhost using pycharm)
Query: 37005081 (INSERT in AIRFLOW AWS ECS service)
STL_QUERY Shows that both queries takes around 2 min
select * from STL_QUERY where query=37005081 OR query=37005074 order by query asc;
Query: 37005074 2020-12-14 07:44:57.164336,2020-12-14 07:46:36.094645,0,0,24
Query: 37005081 2020-12-14 07:45:04.551428,2020-12-14 07:46:44.834257,0,0,3
STL_WLM_QUERY Shows that no queue time, all in exec time
select * from STL_WLM_QUERY where query=37005081 OR query=37005074;
Query: 37005074 Queue time 0 Exec time: 98924036 est_peak_mem:0
Query: 37005081 Queue time 0 Exec time: 100279214 est_peak_mem:2097152
SVL_S3QUERY_SUMMARY Shows that query takes 3-4 seconds in s3
select * from SVL_S3QUERY_SUMMARY where query=37005081 OR query=37005074 order by endtime desc;
Query: 37005074 2020-12-14 07:46:33.179352,2020-12-14 07:46:36.091295
Query: 37005081 2020-12-14 07:46:41.869487,2020-12-14 07:46:44.807106
stl_return Comparing min start for to max end for each query. 3-4 seconds as says SVL_S3QUERY_SUMMARY
select * from stl_return where query=37005081 OR query=37005074 order by query asc;
Query:37005074 2020-12-14 07:46:33.175320 2020-12-14 07:46:36.091295
Query:37005081 2020-12-14 07:46:44.817680 2020-12-14 07:46:44.832649
I dont understand why SVL_S3QUERY_SUMMARY shows just 3-4 seconds to run query in spectrum, but then STL_WLM_QUERY says the excution time is around 2 minutes as i see in my localhost and production environtments... Neither how to improve it, because stl_return shows that query returns few data.
EXPLAIN
XN Partition Loop (cost=0.00..400000022.50 rows=10000000000 width=19608)
-> XN Seq Scan PartitionInfo of parquet.table (cost=0.00..22.50 rows=1 width=0)
Filter: (((partition_0)::text = 'A'::text) AND ((partition_1)::text = '2020'::text) AND ((partition_2)::text = '12'::text) AND ((partition_3)::text = '10'::text) AND ((partition_4)::text = '12'::text))
-> XN S3 Query Scan parquet (cost=0.00..200000000.00 rows=10000000000 width=19608)
" -> S3 Seq Scan parquet.table location:""s3://parquet"" format:PARQUET (cost=0.00..100000000.00 rows=10000000000 width=19608)"
svl_query_report
select * from svl_query_report where query=37005074 order by segment, step, elapsed_time, rows;
Just like in your other question you need to change your keypaths on your objects. It is not enough to just have "A" in the keypath - it needs to be "partition_0=A". This is how Spectrum knows that the object is or isn't in the partition.
Also you need to make sure that your objects are of reasonable size or it will be slow if you need to scan many of them. It takes time to open each object and if you have many small objects the time to open them can be longer than the time to scan them. This is only an issue if you need to scan many many files.

Postgresql - IN clause optimization for more than 3000 values

I have an application where the user will be uploading an excel file(.xlsx or .csv) with more than 10,000 rows with a single column "partId" containing the values to look for in database
I will be reading the excel values and store it in list object and pass the list as parameter to the Spring Boot JPA repository find method that builds IN clause query internally:
// Read excel file
stream = new ByteArrayInputStream(file.getBytes());
wb = WorkbookFactory.create(stream);
org.apache.poi.ss.usermodel.Sheet sheet = wb.getSheetAt(wb.getActiveSheetIndex());
Iterator<Row> rowIterator = sheet.rowIterator();
while(rowIterator.hasNext()) {
Row row = rowIterator.next();
Cell cell = row.getCell(0);
System.out.println(cell.getStringCellValue());
vinList.add(cell.getStringCellValue());
}
//JPA repository method that I used
findByPartIdInAndSecondaryId(List<String> partIds);
I read in many articles and experienced the same in above case that using IN query is inefficient for huge list of data.
How can I optimize the above scenario or write a new optimized query?
Also, please let me know if there is optimized way of reading an excel file than the above mentioned code snippet
It would be much helpful!! Thanks in advance!
If the list is truly huge, you will never be lightning fast.
I see several options:
Send a query with a large IN list, as you mention in your question.
Construct a statement that is a join with a large VALUES clause:
SELECT ... FROM mytable
JOIN (VALUES (42), (101), (43), ...) AS tmp(col)
ON mytable.id = tmp.col;
Create a temporary table with the values and join with that:
BEGIN;
CREATE TEMP TABLE tmp(col bigint) ON COMMIT DROP;
Then either
COPY tmp FROM STDIN; -- if Spring supports COPY
or
INSERT INTO tmp VALUES (42), (101), (43), ...; -- if not
Then
ANALYZE tmp; -- for good statistics
SELECT ... FROM mytable
JOIN tmp ON mytable.id = tmp.col;
COMMIT; -- drops the temporary table
Which of these is fastest is best determined by trial and error for your case; I don't think that it can be said that one of the methods will always beat the others.
Some considerations:
Solutions 1. and 2. may result in very large statements, while solution 3. can be split in smaller chunks.
Solution 3. will very likely be slower unless the list is truly large.

pandas.read_sql with sqlite is extremly slow

I'm using pandas.read_sql with an sqlite Database and it is extremly slow.
I have a table with 800 rows and 49 columns (dataype just TEXT and REAL) and it takes over 3 Minutes to fetch the data from database to the dataframe.
The DB-File and the python script are running on the same machine and the same filesystem.
Is there any way to speed up pandas.read_sql?
This is the code fragment:
self.logger.info('{} - START read_sql: {}'.format(table))
result = pd.read_sql("select * from {}".format(table), self.dbconn,
index_col=indexcolname)
self.logger.info('{} - END read_sql: {}'.format(table))
I found the solution myself:
The Problem was using the connection as instance attribut: self.dbconn
If i always initiate a new connection and close it at the end perfomance is absolutely no problem !
conn = self.create_connection(self.db_file)
self.logger.info('{} - START from sql: {}'.format(self.botid, table))
Result = pd.read_sql("select * from {}".format(table), conn,
index_col=indexcolname)
self.logger.info('{} - END from sql: {}'.format(self.botid, table))
conn.close()

how can i dynamically pass value to db2 search clause 'like' while fetching result from other table

Can someone help me how can i dynamically pass value to db2 search clause like while fetching result from other table.
I am trying this:
select * from table2 where file_name like '%(select file_name from table1)'
I've even tried CONTACT, using sysibm.sysdummy1 methods but no luck.
maybe, this help;
SELECT *
FROM table2
JOIN table1
ON table2.file_name LIKE CONCAT('%',table1.file_name)
Not having been shown the DDL for the files, nor any sample data and expected results from which a reader could determine if there might not be [other] considerations as implied obstacles, the following variation of the already-offered answer is more liberal in selecting what might be intended by the select * from table2 where file_name like '%(select file_name from table1)' from the OP; i.e. rather than effective predicates of ends-with [or a starts-with] the file-name value, the following achieves an effective predicate of contains the file-name value.
select /* t1.file_name, */ t2.*
from table2 as t2
inner join
table1 as t1
on t2.file_name like '%' concat rtrim(t1.file_name) concat '%'

Load connecting tables from Cassandra in QlikView with DataSatx ODBC

I am new to both Cassandra (2.0) and QlikView (11).
I have two keyspaces (tables) with large amount of data in Cassandra and I want to load them to QlikView.
Since I can not load the entire set, filtering is necessary.
// In QlikView's edit script
ODBC CONNECT TO [DataStax Cassandra ODBC DSN64];
LOAD idsession,
logintime,
"h_id" as hid;
SQL SELECT *
FROM Cassandra.test.sessions
WHERE logintime > '2015-06-09'
ALLOW FILTERING;
LOAD idhost,
site;
SQL SELECT *
FROM Cassandra.test.hosts
WHERE idhost in hid;
The second query does not work, error from qlikview line 3:16 no viable alternative at input 'hid'.
My question: is it possible to get the h_ids from the first query and only collect the corresponding entities from the second table?
I assume that you can't do an Exists in the DataSyntax ODBC which may help. DataStax doc
This could be done with an external program like (C#) but I really want to do this in QlikView's script file:
// Not complete code
query = select * from sessions where loginTime > '2015-06-09';
foreach (var id in query) {
query2 = "select * from hosts where idhost = " + i;
}
EDIT
This can be solved when loading XML files:
TableA:
LOAD id,
itema
FROM
[C:\test1data.xlsx]
(ooxml, embedded labels);
TableB:
LOAD idb,
itemb,
ida
FROM
[C:\test2data.xlsx]
(ooxml, embedded labels) where(Exists (id,ida));
EDIT2
Besides the great answer from #i_saw_drones another solutions is to loop through ids.
For i = 1 to NoOfRows('Sessions')
Let cur_id = Peek('hid',i - 1,'Sessions');
LOAD
idhost,
site;
SQL SELECT *
FROM Cassandra.test.hosts
WHERE idhost = $(cur_id);
NEXT i
Nevertheless was the performance not the great. It took about 30 minutes to load around 300 K lines from Cassandra. The same queries were tested in a C# program with the connector and it took 9 sec. But that was just the query. Then you should write it to XML and then load it to QlikView.
The reason that the second query fails is because the WHERE clause is expecting to find a literal string list of values to look "in". For example:
LOAD
idhost,
site;
SQL SELECT *
FROM Cassandra.test.hosts
WHERE idhost in ('ID1', 'ID2', 'ID3', 'ID4');
The hid field returned by the first query is a QlikView list and as such cannot be immediately coerced into a string. We have to do a little more scripting to obtain a list of values from the first query in literal form, and then add that to the second query as part of the WHERE clause. The easiest way to do this is to concatenate all of your hids into a string and then use the string as part of your WHERE IN clause.
ODBC CONNECT TO [DataStax Cassandra ODBC DSN64];
MyData:
LOAD
idsession,
logintime,
"h_id" as hid;
SQL SELECT *
FROM Cassandra.test.sessions
WHERE logintime > '2015-06-09'
ALLOW FILTERING;
hid_entries:
LOAD
chr(39) & hids & chr(39) as hids;
LOAD
concat(hid, chr(39) & ',' & chr(39)) as hids;
LOAD DISTINCT
hid
RESIDENT MyData;
LET hid_values = '(' & peek('hids',0,'hid_entries') & ')';
DROP TABLE hid_entries;
LOAD
idhost,
site;
SQL SELECT *
FROM Cassandra.test.hosts
WHERE idhost in $(hid_values);

Resources