Data extraction from mainframe to excel - mainframe

how to extract the data from mainframe into excel? Currently , I am fetching data from MS access but the requirements are for Mainframe.
Thanks in advance

First, please understand that saying "extract data from mainframe" is similar to saying "extract data from Intel." The following is not comprehensive but is intended to provide an idea of how to ask your question in a manner which can be meaningfully answered.
Please understand there is a big difference between...
what is technically possible
what is allowed in your shop
what is likely to provide a robust and maintainable solution given your requirements
These are three very different things. Some of us answering questions here on Stack Overflow have life experiences that make us reticent about answering questions regarding what is technically possible absent any mention of what is allowed in your shop or what the actual business requirement is that is being solved.
Mainframes have been around for over half a century, and many shops have standard solutions to technical problems. Sometimes the solution is "don't do that, and here's what we do instead." Working against the recommendations of your technical staff, or your shop standards, is career limiting.
What operating system?
z/OS is in common use on mainframes, but there do exist shops that still run one of its ancestors like MVS/XA. The mainframe operating system traces its roots back to OS/360 first available in 1965.
z/TPF
z/Linux usually runs on top of the z/VM hypervisor.
z/VSE
In what sort of file does the data reside?
QSAM or Queued Sequential Access Method, also commonly called flat files.
VSAM or Virtual Sequential Access Method. There are several different kinds of VSAM files including KSDS (Keyed Sequential Data Set) ESDS (Entry Sequenced Data Set), RRDS (Relative Record Data Set) and Linear (conceptually similar to a memory mapped file).
a DBMS like DB2 or IMS. A DBMS typically has extract facilities to allow writing a flat file from its own internal format. DB2, for example, stores data in Linear VSAM datasets.
Unix System Services files reside in a different file system than QSAM or VSAM. This will be more familiar, as it has a directory structure where the classic z/OS file system has none.
What does the data look like?
You must know the record layout of the data you wish to retrieve.
It is common for mainframe data to include both text and binary data in a single record, for example a name and a currency amount:
Hopper Grace ar%
...which would be...
x'C8969797859940404040C799818385404040404081996C'
...in hex. This is code page 37, commonly referred to as EBCDIC.
Without knowing that the family name is confined to the first 10 bytes, the given name confined the the subsequent 10 bytes, and the currency amount is in packed decimal (also known as binary coded decimal) in the next 3 bytes, you cannot accurately transfer the data because code page conversion will destroy the currency amount which is +819.96. Converting to code page 1250, commonly in use on Microsoft Windows, you would end up with...
x'486F707065722020202047726163652020202020617225'
...where the text data is translated but the packed data is destroyed. The packed data no longer has a valid sign in the last nibble (the lower half of the last byte) and the amount itself has been changed.
Security
Is the data you wish to access covered by privacy legislation? You may have to provide some evidence that whatever protections are in place to guarantee that only authorized personnel have access to this data on the mainframe are also in place once you have transferred it off of the mainframe. Such guarantees may have to satisfy an auditor.
What you need
You need to know what operating system holds your data, you need to know what type of file holds your data (a DBMS isn't a type of file but let's let that go for now), and you need to know your record layout(s).
Typically, the easy way to retrieve data is to extract it from its existing data store (QSAM, VSAM, DBMS) into a flat file where all the data is in a text format. There are mainframe utilities to accomplish this. In extreme cases, a program can be written to accomplish this goal. Once it has been accomplished, you can transfer your data without fear of destroying packed or binary data.
You may be able to read data directly from a DBMS if that's where your data resides, but this may depend on shop standards, including security.
Modern mainframes can transfer data via FTP, FTPS, and SFTP. Which is recommended in your shop is something to ask your technical staff.

Related

I m facing issue in Coding this rotation part in cics

How can we rotate CUSTOMER NUMBER values in CICS?
For eg. If customer number is c52063
How can i get onto next value ie, c52064(say) in CICS?
This is a very broad question, essentially you're asking what persistence mechanisms are available in CICS.
Please understand there is a big difference between...
what is technically possible
what is allowed in your shop
what is likely to provide a robust and maintainable solution given your requirements
These are three very different things. Some of us answering questions here on StackOverflow have life experiences that make us reticent about answering questions regarding what is technically possible absent any mention of what is allowed in your shop or what the actual business requirement that is being solved.
Mainframes have been around for over half a century, and many shops have standard solutions to technical problems. Sometimes the solution is "don't do that, and here's what we do instead." Working against the recommendations of your technical staff, or your shop standards, is career limiting.
A couple of options, not intended to be an exhaustive list...
SELECT and UPDATE the value in a DBMS (such as DB2). You must code your SELECT SQL with FOR UPDATE.
READ and REWRITE the value in a VSAM file. You must code your READ with the UPDATE option.
In either case you are holding a lock on the resource until you hit either an explicit (EXEC CICS SYNCPOINT) or implicit (end of transaction) syncpoint or rollback (EXEC CICS SYNCPOINT ROLLBACK or abend condition). Holding such a lock means all other instances of your transaction will wait until the syncpoint or rollback has occurred.
If you know for certain your application will be limited to a single CICS region... Other options would include having a transaction initiated as part of region initialization processing that would obtain and populate a shared resource such as a temporary storage queue with a name known to your application with the last known customer number. This initialization transaction would have to obtain the highest used customer number from somewhere, probably a DBMS or a VSAM file. Applications would have to be coded to ENQ and DEQ their access to the temporary storage queue. You could do this without using a temporary storage queue but with shared memory and storing the address of that memory in the CICS CWA for your region. Again, ENQ and DEQ logic would have to be coded in the applications.
You could use a named counter as defined by your CICS Systems Programmer. Be certain to read and understand the recovery requirements for your application as documented in the IBM Knowledge Center.
Again, this is not an exhaustive list, it is just to give an overview of some of the options available. Talk to your technical staff, they likely have either a standard solution as employed by your shop or a preference based on their experience and your requirements.

Adding records to VSAM DATASET [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have some confusions regarding VSAM as I am new to it. Do correct me where I am wrong and solve the queries.
A cluster contains control areas and a control area contains control intervals. One control interval contains one dataset. Now for defining a cluster we mention a data component and index component. Now this name of data component that we gives creates a dataset and name of index generates a key. My queries are as follows-
1)If I have to add a new record in that dataset, what is the procedure?
2)What is the procedure for creating a new dataset in control area?
3)How to access a dataset and a particular record after they are created?
I tried finding a simple code but was unable so kindly explain with a simple example.
One thing that is going to help you is the IBM Redbook VSAM Demystified: http://www.redbooks.ibm.com/abstracts/sg246105.html which, these days, you can even get on your smartphone, amongst several other ways.
However, your current understanding is a bit astray so you'll need to drop all of that understanding first.
There are three main types of VSAM file and you'll probably only come across two of those as a beginner: KSDS; ESDS.
KSDS is a Key Sequenced Data Set (an indexed file) and ESDS is an Entry Sequenced Data Set (a sequential file but not a "flat" file).
When you write a COBOL program, there is little difference between using an ESDS and a flat/PS/QSAM file, and not even that much difference when using a KSDS.
Rather than providing an example, I'll refer you to the chapter in the Enterprise COBOL Programming Guide for your release of COBOL, it is Chapter 10 you want, up to and including the section on handling errors, and the publication can be found here: http://www-01.ibm.com/support/docview.wss?uid=swg27036733, you can also use the Language Reference for the details of what you can use with VSAM once you have a better understanding of what it is to COBOL.
As a beginning programmer, you don't have to worry about what the structure of a VSAM dataset is. However, you've had some exposure to the topic, and taken a wrong turn.
VSAM datasets themselves can only exist on disk (what we often refer to as DASD). They can be backed-up to non-DASD, but are only directly usable on DASD.
They consist of Control Areas (CA), which you can regard as just being a lump of DASD, and almost exclusively that lump of DASD will be one Cylinder (30 Tracks on a 3390 (which these days is very likely emulated 3390). You won't need to know much more about CAs. CAs are more of a conceptual thing that an actual physical thing.
Control Intervals (CI) are where any data (including index data) is. CIs live in CAs.
Records, the things you will have in the FILE SECTION under an FD in a COBOL program, will live in CIs.
Your COBOL program needs to know nothing about the structure of a VSAM dataset. COBOL uses VSAM Access Method Services (AMS) to do all VSAM file accesses, as far as your COBOL program is concerned it is an "indexed" file with a little bit on the SELECT statement to say that it is a VSAM file. Or is is a sequential file with a little... you know by now.

Crowdsourcing reliability measurements - spam/fraud detection

I'd like to collect some kind of geographical information from website users - for given set of data they will mark checkbox indicating whether place has or has not given property. Are there any tools/frameworks for detecting fraud or spam submissions based on whole colected data set (and possibly other info)? I'd like to get filtered, more reliable data.
Not sure if that's exactly what you're asking for, but here are some tips from my experience using Amazon Turk:
There are several academic papers dealing with such problems. here is a good one.
In addition, based on the following general recommendations, I've created a custom procedure which worked on my data:
a. Include an open question, and filter out cases where it wasn't answered. It's harder to answer such a question automatically, and it might also be more time-consuming, thus less attractive, for a fraudster.
b. If possible, don't use a binary scale (i.e. a checkbox), but some grade (e.g. 1-4 or 1-6). This would give you more data to work with.
c. If available, filter out cases where the time spent in filling your form was too short. (especially useful if you include that open question)
d. If you have multiplicity of inputs per user, check for repetitive answers, and for users which consistently give far-from-average answers.
If each user submits only a single "form", consider putting more than a single element/question in it, so you'll get multiple submissions per-user.
e. If you have only a single submission per user or user-id, your options are more limited. I can suggest filtering out outliars, (e.g. data points farther than 3 standard deviations from the average), in case you have enough data.
f. After all the filtering, check the agreement or disagreement in your data (e.g. by checking what proportion of your data points fall within x standard deviations from the average). In case of agreement, use the average; in case of disagreement, collect some more data.
Hope it helps,

Integrating with 500+ applications

Our customers use 500+ applications and we would like to integrate these applications with our. What is the best way to do that? These applications are time registration applications and common for most of them is that they can export to csv or similar, some of them are actually home-brewed excel sheets where time is registered.
The best idea so far is to create our own excel sheet, which can be used to integrate with all these applications. The integrations could be in the form of cells containing something like ='[c:\export.csv]rawdata'!$A$3 Where export.csv is the csv file exported from the time registration applications. Can you see a better way to integrate against all these applications? It should be mentioned that almost all our customers have Microsoft Office.
Edit: Answers to the excellent questions from Pontus Gagge:
How similar are the data in the different applications?
I assume that since they time registration applications, they will have some similarities, but I assume that some will register the how long time one has worked in total for a whole month, while others will spesify for each day. If Excel is chosen, I believe that many of the differences could be ironed out using basic formulas.
What quality is the data?
The quality of the data can vary so basic validation must be undertaken, a good way is also to make it transparent for the customers, how our application understands their input, so they are responsible.
How large amounts of data are you talking about?
There will be information about the time worked for up to 50 employees.
Is the integration one-way only?
Yes
With what frequency should information be transferred?
Once per month (when they need to pay salaries).
How often do the applications themselves change, and how often does your product change?
If their application is a home-brewed Excel sheet, then I assume it will change once a year (due for example a mistake someone). If it is a standard proper time registration application, then I do not believe they are updated more often than every fifth year or so, as it is a very stabile concept.
Should the integration be fully automatic or can your end users trigger a data transfer?
They can surely trigger data transfer. The users are often dedicated to the process so they can be trained at doing it, which means that they could make up to, say 30, mouse clicks in order to integrate each month.
Will the customers have somebody to monitor the integrations?
As we have many customers, many of them should be able to undertake the integration themselves. We will though be able to assist them over the telephone. We cannot, though undertake the integration ourselves because we would then be responsible for any errors due to user mistakes, etc.
Does the phrase 'integration spaghetti' mean anything to you...?
I am looking for ideas from the best chefs to cook a nice large portion of that.
You need to come up with a common data format, and a way to translate the individual data formats to the common format. There's really no way around this - any solution you come up with will have to do this in one way or the other. It's the essential complexity of what you're doing.
The bigger issue is actually variances within the source data, in terms of how things like dates are stored, missing columns, etc. Doing a generic conversion for CSV to move columns around is comparatively easy.
I would also look at CSV and then use an OLEDB connection against the CSV file for importing.
If you try to make something that can interface to any data structure in the universe (and 500 is plenty close enough), it is guaranteed to be a maintenance nightmare. Instead I would approach this from multiple angles:
Devise an interface into which a human can enter this data already in the proper format. With 500+ clients, I'd make this a small, raw but functional browser based site that users can use to enter this information manally. This is the fall-back. At the end of the day, a human can re-key the information into the site and solve the import issue. Ideally, everyone would use this instead of their own format. Data entry people are cheap.
Similar to above, but expanded, I would develop a standard application or standardize on an off-the-shelf application that can be used to replace their existing format. This might take more time than #1. The goal would be to only do one-time imports of these varying data schemas into the application and be done with them for good.
The nice thing about spreadsheets is that you can do anything anywhere. The bad thing about spreadsheets is that you can do anything anywhere. With CSV or a spreadsheet there is simply no way to enforce data integrity and thus consistency (which is the primary goal) on the data. If the source data is already in a database, then that is obviously simpler.
I would be inclined to use database format into which each of these files need to be converted rather than a spreadsheet (e.g. use something like Jet (MDB)). If you have non-Windows users then that will make it harder and you might have to use a spreadsheet. The problem is that it is too easy for the user to change their source structure, break their upload and come crying to you. If a given end user has a resident expert, they can find a way of importing the data into that database format . If you are that expert, then I would on a case-by-case basis, write something that would import into that database format. XML would be the other choice, but that will likely take more coding than an import/export into a database format.
Standardization of the apps (even having all the sources in a database format instead of a spreadsheet would help) and control over the data schema is the ultimate goal rather than permitting a gazillion formats. There really is no nice answer other than standardization. Otherwise, you are having to write a converter for every Tom-Dick-and-Harry format and again when someone changes the source format.
With a multitude of data sources mapping each one correctly to an intermediate format is not trivial. Regular expressions are good with a finite set of known data formats. Multipass can help when data is ambiguous without context (month,day fields and have several days of data), and also help defeat data entry errors. But it seems as this data is connected to salaries there needs a good reliable transfer.
An import configuring trick
Get the customer to make a set of training data in the application. It should have a "predefined unique date" and each subsequent data field have a number corresponding to the target data field in your application. On importing your application needs to recognise the predefined date, determine the unique translation required and effect the displaying/saving of this "mapping key", and stop the import. eg If you expect "Duration hours" in field two then get the user to enter 2 in the relevant field which might be "Attendance hours".
On subsequent runs, and with the mapping definition key, import becomes a fairly easy process of translation.
Note on terms
"predefined date" - must be historical, say founding date of your company?, might need to be in PC clock settable range.
"mapping key" - could be string of hex digits and nybble based so tractable to workout
The entered code can be extended to signify required conversions ie customer's application has durations in days and your application expects it in hours.
Interfacing with windows programs (in order if increasing fragility)
Ye Olde saving as CSV file
Print to operating system printer that is setup as a text file/pdf, then scavenge the data out of that
Extract data via the application interface control, typically ActiveX for several windows programs ie like Matlab's Spreadsheet Link
Read native file format xls format ie like Matlab's xlsread
Add an additional intermediate spreadsheet sheet that has extended cell references ie ='[filename]rawdata'!$A$3
Have a look at Teiid by JBoss: http://jboss.org/teiid
Also consider using SOA - e.g., if you're on Java, try JBoss SOA platform: http://www.jboss.com/resources/soa/?intcmp=1004
Use a simple XML format. A non-technical person can easily understand a simple XML format (and could even identify basic problems with XML documents that are not well-formed).
Maybe use a DTD (or even better an XML schema) to do very basic validation, and then supplement this with an XSL stylesheet to do more validation with better error reporting. (An XSL stylesheet simply converts from XML to something else and so can be generate readable error messages.)
The advantage of this approach is that web browsers such as Internet Explorer can apply the XSL stylesheets. A customer need only spend at most a day enhancing their applications or writing excel macros to generate the XML data in the format that you specify.
Recent versions of Excel have support for converting spreadsheet data to XML, and can even validate against schemas.
Once the data passes the XSL validation checks, you have validated XML data.
If you have heaps of data and heaps of money, you could look at existing data management and cleansing tools:
http://www-01.ibm.com/software/data/infosphere/datastage
http://www-01.ibm.com/software/data/infosphere/qualitystage
But even then, you'll likely need to follow kyoryu's suggestion assuming you have 500+ data formats. The problem isn't your side. You need them to standardize their output formats if you have no control over their apps. CSV is likely the easiest. You could even send them a excel template to help them along.

Production, Test, Developer Environments vs Security

What are current practices for enabling developers to build systems that contain private data? Can anyone point to a "best practices" guide for that sort of thing?
We have a Catch-22 here in that developers need to write applications that go against systems that have data that is considered "private." The IT administration would like for us developers to not have access to the data (ie. provide a schema or data structure, but not data itself) whereas most developers (myself included) would like to have access to the production data since not having a representative dataset can lead to bad assumptions (eg. the format of data) and bugs later on.
Does anyone have any formalized "best practices" for this type of thing? Especially official guildines from some "BigCo" (eg. Microsoft, IBM) might help since it is needed to convince management.
My view of the world may be different, as I'm based in the UK, but for the past 20-odd years, I've worked primarily in the public sector on systems handling sensitive data.
The rules are **completely** cut-and-dried. No production data is allowed on the development estate.
As a fundamental principle, we do not want to be responsible for the loss of sensitive data. The users are perfectly good at that, themselves.
Within the past 12 months, my wife has moved from the same regime to one in the private sector where they allow developers access to production data and she's horrified by it. The legal implications (in the UK, at least) can be severe.
Developers don't **need** access to production data. It's simply laziness. Define and create test data to exercise defined test cases (including edge cases) and don't rely on the random-esque nature of production data.
If you **must** use production data (i.e. you manage to convince someone who doesn't know any better that it's acceptable), ensure the data is anonymised **before** it reaches the development estate.
Often times, a subset of sanitized data will be provided that is representative of the private data, but not the private data itself.
At my company, we started using Red-gate's data generator to generate test data. There is a bit of setup, but you can use the tools to generate very usable test data. Yes, I would prefer to use live production data, but it's not feasible (especially if you need to consider in HIPAA). It uses regex for each column and allows you to use look-up table's for related tables.
At MediumCo, we strip proprietary data out of our production data in Test and Dev. It has hurt us a little in the past to not have exactly-representative data, but the clients have asked about this point before, and it's usually not an issue, as the environments are populated with a lot of fake proprietary data.
I don't have any best practices paper or anything. But I would think that if you're developing out of an environment that is as protected as the environment that hosts the data in production, there wouldn't be a lot of argument to be made against it.
That is, if your production database is in a datacenter hosted and controlled and secured by your IT staff, if you have a development database that lives in the exact same scenario and doesn't offer any new ways to access the information - you would be in pretty good shape. As an added token of good will - it might be nice to offer to allow anyone worried about security a chance to do some kind of penetration test to ensure that you're telling the truth about security.
The other side of this, of course, is the analysis of the cost for not using the data: that is, it will lead to buggier code, which will cost $xxxxxx.xx in development time vs. virtually no cost to allow a small subset of your development team access to said data.
To avoid the need to manually sanitise/anonymise data, you could use random text replacement - to replace every alphanumeric character in each text field with a random alphanumeric. This:
keeps the data similar in length, size etc. from the developer's point of view
does not cause problems with character sets
leaves date and number fields untouched, which allows for accurate testing with respect to date ranges and quantities
will satisfy most privacy requirements
If you wanted to go a little further you could run random number-for-number replacement on telephone numbers and zip codes, while using alphanumeric replacement on other text fields.
Having an automated replacement script allows you to get up-to-date data dumps from the live system regularly, so your tests are up-to-date with respect to the size and variability of the data in practice.
It does mean that a small number of operations will not be realistic (e.g. indexing on name fields, which in real life are clustered around common letters) but these should be limited.

Resources