Zedc dataset compression using JCL - mainframe

What is zEDC dataset compression? How can we apply compression on our mainframe datasets. What are the DATACLAS parameter values used for zEDC compression.
Please help
Thanks,
Sunitha S

DATACLAS names are installation-dependent. As #cshneid said you'll probably have to talk to your local storage administrator / system programmer. Or, if you have access to ISMF, you can look for dataclasses that have the compaction parameter set to ZEDC Required (ZR) or ZEDC Preferred (ZP).
Note that there are multiple system configuration things that need to be done to enable the actual utilization of zEDC. So again, talking to the local experts is really the best/recommended approach.

I agree with randomScott. You can get in touch with you system administrator to get info on dataclas and msgclass.
One quick way i used to follow is, put 'Z' in front of dataset and enter. Your pds will be compressed. This works only when none of the user are accessing that particular pds.

Related

Is there a size limit for object addition in the Geode region

We are trying to do a POC to change the way we are storing content in the geode region. We operate on the sketches (sizes can vary from 1GB to 30GB) and currently breaking them into parcels and storing the parcels in the region. We then read these parcels, merge them to create a complete sketch for our processing. We are seeing some inconsistencies in the data due to the cache eviction and trying to come up with an approach of storing the complete object in the region instead of storing the parts.
I was looking at Geode documentation but did not seem to find the size limitation for any entry in the region, but wanted to reach a broader group in case anyone has done anything similar or have some insights into it.
Thanks for your response in advance.
Best Regards,
Amit
According to what I've been investigating, the maximum object size is set as 1GB, you can have a look at GEODE-478 and commit 1e3f89ddcd for further details. It's worth mentioning, as a side note, that objects that big might cause problems with GC, so you might want to stay away from that.
Cheers.

Documentation for spark options

This one is probably easy to answer, but for the life of me I can't seem to find this.
Can someone please point me to documentation for the various key-value pair options that you can use with spark?
Example of such an option (in pyspark):
some_spark_table.write.format("parquet").option("parquet.block.size", 1234)
So if I'm interested in what the unit is for the parquet.block.size option, where do I find that?
I found this link which helpfully states: "To find more detailed information about the extra ORC/Parquet options, visit the official Apache ORC/Parquet websites." But I still can't find it.
As doc says, you can visit the official Apache Parquet website. I think by official website they mean Parquet git repo :)
Citing from there:
Property: parquet.block.size
Description: The block size in bytes. This property depends on the file system:
If the file system (FS) used supports blocks like HDFS, the block size
will be the maximum between the default block size of FS and this
property. And the row group size will be equal to this property.
block_size = max(default_fs_block_size, parquet.block.size)
row_group_size = parquet.block.size
If the file system used doesn't support blocks, then this property will define the row group size.
Note that larger values of row group size will improve the IO when
reading but consume more memory when writing
Default value: 134217728 (128 MB)
Unlike Parquet, Spark's own configuration settings are pretty well documented (the ones they want you to know about) on its website as pointed in another answer.
You have to go through the official documentation to find the properties.
For example, properties related to parquet
(https://spark.apache.org/docs/2.4.0/sql-data-sources-parquet.html#configuration)
While most of the configuration related properties you can find here.
Hope it helps!

How read large number of large files on NFS and dump to HDFS

I am working with some legacy systems in investment banking domain, which are very unfriendly in the sense that, only way to extract data from them is through a file export/import. Lots of trading takes place and large number of transactions are stored on these system.
Q is how to read large number of large files on NFS and dump it on a system on which analytics can be done by something like Spark or Samza.
Back to issue. Due nature of legacy systems, we are extracting data and dumping into files. Each file is in hundreds of gigabyte size.
I feel next step is to read these and dump to Kafka or HDFS, or maybe even Cassandra or HBase. Reason being I need to run some financial analytics on this data. I have two questions:
How to efficiently read large number of large files which are located on one or numerous machines
Apparently you've discovered already that mainframes are good at writing large numbers of large files. They're good at reading them too. But that aside...
IBM has been pushing hard on Spark on z/OS recently. It's available for free, although if you want support, you have to pay for that. See: https://www-03.ibm.com/systems/z/os/zos/apache-spark.html My understanding is that z/OS can be a peer with other machines in a Spark cluster.
The z/OS Spark implementation comes with a piece that can read data directly from all sorts of mainframe sources: sequential, VSAM, DB2, etc. It might allow you to bypass the whole dump process and read the data directly from the source.
Apparently Hadoop is written in Java, so one would expect that it should be able to run on z/OS with little problem. However, watch out for ASCII vs. EBCDIC issues.
On the topic of using Hadoop with z/OS, there's a number of references out there, including a red piece: http://www.redbooks.ibm.com/redpapers/pdfs/redp5142.pdf
You'll note that in there they make mention of using the CO:z toolkit, which I believe is available for free.
However you mention "unfriendly". I'm not sure if that means "I don't understand this environment as it doesn't look like anything I've used before" or it means "the people I'm working with don't want to help me". I'll assume something like the latter since the former is simply a learning opportunity. Unfortunately, you're probably going to have a tough time getting the unfriendly people to get anything new up and running on z/OS.
But in the end, it may be best to try to make friends with those unfriendly z/OS admins as they likely can make your life easier.
Finally, I'm not sure what analytics you're planning on doing with the data. But in some cases it may be easier/better to move the analytics process to the data instead of moving the data to the analytics.
The simplest way to do it better is zconnector, a ibm product for data ingestion between mainframe to hadoop cluster.
I managed to find an answer. The biggest bottleneck is that reading files is essentially a serial operation.. that is the most efficient way to read from a disk. So for one file I am stuck with a single thread reading it from NFS and sending it to HDFS or Kafka via their APIs.
So it appears best way is to make sure that the source from where data is coming dumps files in multiple NFS folders. That point onward I can run multiple processes to load data to HDFS or Kafka since they are highly parallelized.
How to load? One good way is to mount NFS into Hadoop infrastructure and use distcp. There are other possiblities too which open up once we make sure files are available from large number of NFS. Otherwise remember, reading file is a serial operation. Thanks.

web development - deletion of user data?

I have finished my first complex web application and I have found out it is probably better to use "isDeleted" flags in db than hard-deleting records. But I wonder what is the recommended approach for data that are stored on filesystem (e.g. photos). Should I delete them when their related entity is (soft-)deleted or keep them as they are? Can junk accumulation cause running out of storage in practice?
It definitely can - you'll need to gather some stats on how much data the typical account generates, and then figure out how many deletions you're seeing to sort out how much junk data will pile up and/or when you'll fill up your storage.
You might also want to try using something like S3 to store your data - at that point, the only reason you would need to delete things would be because it was costing you too much to store it.

How do we get around the Lotus Notes 60 Gb database barrier

Are there ways to get around the upper database size limit on Notes databases? We are compacting a database that is still approaching 60 gigs in size. Thank you very much if you can offer a suggestion.
Even if you could find a way to get over the 64GB limit it would not be the recommended solution. Splitting up the application into multiple databases is far better if you wish to improve performance and retain the stability of your Domino server. If you think you have to have everything in the same database in order to be able to search, please look up domain search and multi-database search in the Domino Administrator help.
Maybe some parts of the data is "old" and could be put into one or more archive databases instead?
Maybe you have a lot of large attachments and can store them in a series of attachment databases?
Maybe you have a lot of complicated views that can be streamlined or eliminated and thereby save a lot of space and keep everything in the same database for the time being? (Remove sorting on columns where not needed, using "click on column header to sort" is a sure way to increase the size of the view index.)
I'm assuming your database is large because of file attachments as well. In that case look into DAOS - it will store all file attachments on filesystem (server functionality - transparent to clients and existing applications).
As a bonus it finds duplicates and stores them only once.
More here: http://www.ibm.com/developerworks/lotus/library/domino-green/
Just a stab in the dark:
Use the DB2 storage method instead of to a Domino server?
I'm guessing that 80-90% of that space is taken up by file attachments. My suggestion is to move all the attachments to a file share, provided everyone can access that share, or to an FTP server that everyone can connect to.
It's not ideal because security becomes an issue - now you need to manage credentials to the Notes database AND to the external file share - however it'll be worth the effort from a Notes administrator's perspective.
In the Notes documents, just provide a link to the file. If users are adding these files via a Notes form, perhaps you can add some background code to extract the file from the document after it has been saved, and replace it with a link to that file.
The 64GB is not actually an absolute limit, you can go above that, I've seen 80GB and even close to 100Gb although once your past 64Gb you can get problems at any time. The limit is not actually Notes, its the underlying file system, I've seen this on AS400 but the great thing about Notes is that if you do get a huge crash you can still access all the documents and pull everything out to new copies using scheduled agents even if you can no longer get views to open in the client.
Your best best is regular archiving, if it is file attachments then anything over two years old doesn't need to be in main system, just brief synopsis and link, you could even have 5 year archive, 2 year archive 1 year archive etc, data will continue to accumulate and has to be managed, irrespective of what platform you use to store it.
If the issue really is large file attachments, I would certainly recommend looking into implementing DAOS on your server / database. It is only available with Domino Server 8.5 and later. On the other hand, if your database contains over 100,000+ documents, you may want to look seriously at dividing the data into multiple NSF's - at that number of documents, you need to be very careful about your view design, your lookup code, etc.
Some documented successes with DAOS:
http://www.edbrill.com/ebrill/edbrill.nsf/dx/yet-another-daos-success-story-from-darren-duke?opendocument&comments
If you're database is getting to 60gb.. don't use a Domino solution you need to switch to a relational database. You need to archive or move documents across several databases. Although you can get to 60gb, you shouldn't do it. The performance hit for active databases is significant. Not so much a problem for static databases.
I would also look at removing any unnecessary views & their indexes. View indexes can occupy 80-90% of your disk space. If you can't remove them, simplify their sorting arrangements/formulas and remove any unnecessary column sorting options. I halved a 50gb down to 25gb with a few simple changes like this and virtually no users noticed.
One path could be, for once, to start with the user. Do all the users need to access all that data all the time ? If no, it's time to split or archive. If yes, there is probably a flaw in the design of the application.
Technically, I would add to the previous comments a suggestion to check the many options for compaction. Quick and dirty : disard all view indices, but be sure to rebuild at least the one for the default view if you don't want your users to riot. See updall
One more thing to check: make sure you have checked
[x] Use LZ1 compression for attachments
in db properties.

Resources