What's wrong with my SORT function here? - mainframe

First off, I'm a complete beginner to anything Mainframe-related.
I have a training assignment at work to find matching keys in two files using SORT. I submitted this code to my mentor, pseudo-coded here because I can't access the system from home yet and didn't think to copy it before leaving:
//STEP01 EXEC SORT
//SORTIN DD DSN=file1
// DD DSN=file2
//SORTXSUM DD DSN=output file
//SORTOUT don't need this data anywhere specific so just tossing at spool
//SYSIN DD *
SORT FIELDS=(1,22,CH,A)
SUM FIELDS=NONE,XSUM
/*
When I stick a couple of random sequential files in, the output is exactly what I expect it to be. However, my mentor says it doesn't work. His English is kinda bad and I rarely understand what he's saying the first few times he repeats it.
This combined with him mentioning JOINKEYS (before promptly leaving work, of course) makes me think he just wants (needs?) it done a different way and is doing a really poor job of expressing it.
Either way, could someone please tell me whether or not the code I wrote sucks and explain why it apparently falls short of a method using JOINKEYS?

Here's the requirement that would satisfy:
Take two unsorted datasets; match them on a 22-byte key; output all the data to one of two files. Where keys are duplicate, pick a record of the matched group, whichever is convenient to you, and which selection cannot be guaranteed to be recreated in a subsequent run, and write it to an output file; write all records not written to the first file to the second file instead.
If that is the requirement, you are on to a winner, as it will perform better than the equivalent JOINKEYS.
The solution can also be modified in a few ways. With OPTION EQUALS or EQUALS on the SORT statement, it will always be the first record of equal keys which will be retained.
For more flexibility on what is retained, DUPKEYS could be used instead of SUM.
If the requirement can be satisfied with SUM or DUPKEYS it is more efficient to use them than to use JOINKEYS.
If the data is already in sequence, but otherwise the requirement is the same, then it is not a good way to do it. You can try a MERGE in place of the SORT, and have a SORTIN01 instead of your SORTIN.
If you had DFSORT instead of SyncSORT, you could use ICETOOL's SELECT operator to do all that XSUM and DUPKEYS can do (and more).
If you are doing something beyond what SUM and DUPKEYS can do, you'll need JOINKEYS.
For instance, if the data is already in sequence, you'd specify SORTED on the JOINKEYS for that input.
On the Mainframe, resources are paid for by the client. So we aim to avoid profligacy. If one way uses fewer resources, we chose that.
Without knowing your exact requirement, can't tell if your solution is the best :-)

Related

VariesAcrossGroups lost when ReInsert_ing doc.ParameterBindings?

Our plugin maintains some instance parameter values across many elements, including those in groups.
Occasionally the end users will introduce data that activates an unused Category,
so we have to update the document parameter bindings, to include those categories. However, when we call
doc.ParameterBindings.ReInsert()
our existing parameter values inside groups are lost, because our VariesAcrossGroups flag is toggled back to false?
How did Revit intend this to work - are we supposed to use this in a different way, to not trigger this problem?
ReInsert() expects a base Definition argument, and would usualy get an ExternalDefinition supplied.
To learn, I instead tried to scan through the definition-keys of existing bindings and match those.
This way, I got the document's InternalDefinition, and tried calling Reinsert with that instead
(my hope was, that since its existing InternalDefinition DID include VariesAcrossGroups=true, this would help). Alas, Reinsert doesn't seem to care.
The problem, as you might guess, is that after VariesAcrossGroups=False, a lot of my instance parameters have collapsed into each other, so they all hold identical values. Given that they are IDs, this is less than ideal.
My current (intended) solution is to instead grab a backup of all existing parameter values BEFORE I update the bindings, then after the binding-update and variesAcrossGroups back to true, then inspect all values and re-assign all parameter-values that have been broken. But as you may surmise, this is less than ideal - it will be horribly slow for the users to use our plugin, and frankly it seems like something the revitAPI should take care of, not the plugin developer.
Are we using this the wrong way?
One approach I have considered, is to bind every possibly category I can think of, up front and once only. But I'm not sure that is possible. Categories in themselves are also difficult to work with, as you can only create them indirectly, by using your Project-Document as a factory (i.e. you cannot create a category yourself, you can only indirectly ask the Document to - maybe! - create a category for you, that you request). Because of this, I don't think you can bind for all categories up front - some categories only become available in the document, AFTER you have included a given family/type in your project.
To sum it up: First, I
doc.ParameterBindings.ReInsert()
my binding, with the updated categories. Then, I call
InternalDefinition.SetAllowVaryBetweenGroups()
(after having determined IDEF.VariesAcrossGroups has reverted back to false.)
I am interested to hear the best way to do this, without destroying the client's existing data.
Thank you very much in advance.
(I'm not sure I will accept my own answer).
My answer is just, that you can survive-circumvent this problem,
by scanning the entire revit database for your existing parmater values, before you update the document bindings.
Afterwards, you reset VariesAcrossGroups back to its lost value.
Then, you iterate through your collected parameters, and verify which ones have lost their original value, and reset them back to their intended value.
One trick that speeds this up a bit, is that you can check Element.GroupId <> -1. That is, those elements that are group members.
You only need to track elements which are group members, as it's precisely those that are affected by this Revit bug.
A further tip is, that you should not only watch out for parameter-values that have lost their original value. You must also watch out for parameter-values that have accidentally GOTTEN a value, but which should be left un-set.
I just use FilteredElementCollector with WhereElementIsNotElementType().
Performance-wise, it is of course horrible to do all this,
but given how Revit behaves, I see no other solution if you have to ship to your clients.

can you please explain me the difference between BLKSET sort option and NOBLKSET sort option?

Recently I came across an abend in a SORT step in a Mainframe job, where SORTOUT is VSAM file and SORTIN is a equential file.
The error is:
ICE077A 0 VSAM OUTPUT ERROR L(12) SORTOUT
One of my senior colleague suggested to me to see if there are any duplicates, but I didn't find any duplicates in the input file.
s
After going thru some manuals, I found that OPTION NOBLKSET control card overrides the default BLOCKSET COPY TECHNIQUE, and can be used to bypass sort errors (provided all the possible effects of bypassing the sort error are analysed), so I used
OPTION NOBLKSET. Now the step executes successfully.
After analysing the SYSOUT I found that
ICE143I K PEERAGE SORT TECHNIQUE SELECTED
Can any one explain how the BLOCKSET technique works and how PEERAGE technique works?
SORT used in our system is DFSORT.
You can start here, which explains that of three techniques Blockset is DFSORT's preferred and most efficient technique for sorting, merging and copying datasets: http://pic.dhe.ibm.com/infocenter/zos/v1r12/index.jsp?topic=%2Fcom.ibm.zos.r12.icea100%2Fice1ca5028.htm
Peerage/Vale and Conventional are the other two techniques, of which one is selected which is thought to be the next best if it is not possible to use Blockset.
You have misread references to the use of NOBLKSET. In cases where what would effectively be "internal" errors encountered by DFSORT, and if BLOCKSET is being used, turning off Blockset will cause another SORT method to be chosen, which perhaps will get your step run and Production finished whilst the error is investigated with the step that used Blockset.
NOBLKSET is not a cure-all and did not affect your use of DFSORT. You should only use NOBLKSET in very limited circumstances which are suggested to you for very particular reasons. Blockset is significantly more efficient than Peerage/Vale or Conventional.
You should update your question with a sample of the input data and an IDCAMS LISTCAT of the KSDS.
You either had a duplicate key, or the insertions (your file being written) were not in sequence. Remember you can get duplicates if you have a KSDS with data on it already.
If you want details about Blockset and Peerage/Value, you'll have to hit technical journals and possibly patent listings. I don't know why you'd want to go that far. Perhaps knowing that, you now don't?

Match snippets in an HTML document

Somebody has presented me with a very large list of copyedits to make to a long HTML document. The edits are in the format:
"religious" should be "religions"
"their" should be "there"
"you must persistent" should be "you must be persistent"
The copyedits were typed by hand; in some cases, the "actual" value on the left is not an exact match for the content in the document. The order of edits is usually correct, but even that is not guaranteed.
It's a straightforward but very large task to apply these edits by hand to the document. I'd like to automate the process as much as possible, e.g. by automatically searching for the snippets.
In a long document like this, I can't just search for all instances of "their" and replace them with "there." Sometimes "their" was used correctly, just not in one particular instance.
In other words, I'm looking for a fuzzy text match, where the order of the edits influences the search.
What's a good approach to a problem like this? I'm hoping that there's some off-the-shelf open source project that can search for the snippets in a fuzzy order.
I am not aware of any tool. But I would use edit distance for both:
for non-exact string match: probably std. Levenstein + swap (i.e. Damerau-Levenstein distance)
for non-exact sequence match: this time probably only with Match and Swap operations. You can use free (zero-cost) Insert to get the words that should not be edited.
It should not be hard to implement. But the computational complexity will be quite high. I would use some heuristics to skip hopeless matches. Preprocessing words in the document and the edit list could be good: get a set of chars for each word to allow a quick comparison before calculating full edit distance), etc.

How do I create an array of resources using Jena?

I am using Jena and Java, and am reading a CSV file. For each line of the file there is a subject resource. Two subject resources, on adjacent lines, might have share the same value of a field in the line (e.g: both lines have the same process id). In this case, I need to combine the two subject resources as each one represents a sub-process in production (for example).
My question is: how can I reference those two resources dynamically so that I can combine them? I came to the idea that when I find that they share the same property to store them in an array resource subjects. Is it the right approach?
This question would be a lot easier to answer if you could show some sample data. As it is, I think you're focusing on the wrong bit of the question. If you can decide clearly what it means to have two rows in your CSV with identical process, and then you decide how you're going to encode that meaning in your RDF model, then the question of how to write the code - as an array or whatever - will be much clearer.
For example (and I'm going to make up some data here - as I said, it would be easier if you show an actual example), suppose your CSV contains:
processId,startTime,endTime
123,15:22:00,15:23:00
123,16:22:00,16:25:00
So process 123 has, apparently two start and end time pairs. If you model this naively in RDF, you'll end up with a confusing model:
process:process123
a :Process;
process:start "15:22:00"^^xsd:time;
process:end "15:23:00"^^xsd:time;
process:start "16:22:00"^^xsd:time;
process:end "16:25:00"^^xsd:time;
.
which would suggest that one process had two start times (and two end times) which looks nonsensical. However, it might be that in reality you have a single process with multiple episodes, suggesting one way to model it, or a periodic process which occurs at different times, or, as you suggested, sub-processes of a parent process. Or something else entirely (I'm only guessing, I don't know your domain). Once you are clear what the data means, you can produce a suitable RDF model. For example, an episodic process might be:
process:process123
a :Process;
process:episode [
a process:Episode;
process:start "15:22:00"^^xsd:time;
process:end "15:23:00"^^xsd:time;
];
process:episode [
a process:Episode;
process:start "16:22:00"^^xsd:time;
process:end "16:25:00"^^xsd:time;
]
.
Once the modelling is clear in your mind, I think you can see that the question of how to produce the desire RDF triples from Java code - and whether or not you need an array - is much clearer. Equally importantly, you can think in terms of the JUnit tests you would write to test whether your code is behaving correctly.

matching common strings between two data sets

I am working on a website conversion. I have a dump of the database backend as an sql file. I also have a scrape of the website from wget.
What I'm wanting to do is map database tables and columns to directories, pages, and sections of pages in the scrape. I'd like to automate this.
Is there some tool or a script out there that could pull strings from one source and look for them in the other? Ideally, it would return a set of results that would say soemthing like
string "piece of website content here" on line 453 in table.sql matches string in website.com/subdirectory/certain_page.asp on line 56.
I don't want to do line comparisons because lines from the database dump (INSERT INTO table VALUES (...) ) aren't going to match lines in the page where it actually populates (<div id='left_column'><div id='left_content'>...</div></div>).
I realize this is a computationally intensive task, but even letting it run over the weekend is fine.
I've found similar questions, but I don't have enough CS background to know if they are identical to my problem or not. SO kindly suggested this question, but it appears to be dealing with a known set of needles to match against the haystack. In my case, I need to compare haystack to haystack, and see matching straws of hay.
Is there a command-line script or command out there, or is this something I need to build? If I build it, should I use the Aho–Corasick algorithm, as suggested in the other question?
So your two questions are 1) Is there already a solution that will do what you want, and 2) Should you use the Aho-Corasick algorithm.
The first answer is that I doubt you'll find a ready-built tool that will meet your needs.
The second answer is that, since you don't care about performance and have a limited CS background, that you should use whatever algorithm you find simplest to implement.
I will go one step further and propose an architecture.
First, you need to be able to parse the .sql files into a meaningful way, one that go line-by-line and return tablename, column_name, and value. A StreamReader is probably best for this.
Second, you need a parser for your webpages that will go element-by-element and return each text node and the name of each parent element all the way up to the html element and its parent filename. An XmlTextReader or similar streaming XML parser, such as SAXON is probably best, as long as it will operate on non-valid XML.
You would need to tie these two parsers together with a mutual search algorithm of some sort. You will have to customize it to suit your needs. Aho-Corasick will apparently get you the best performance if you can pull it off. A naive algorithm is easy to implement, though, and here's how:
Assuming you have your two parsers that loop through each field (on the one hand) and each text node (on the other hand), pick one of the two parsers and have it go through each of the strings in its data source, calling the other parser to search the other data source for all possible matches, and logging the ones it finds.
This cannot work, at least not reliably. Best case: you would fit every piece of data to its counterpart in your HTML files, but you would have many false positives. For example user names that are actual words etc.
Furthermore text is often manipulated before it is displayed. Sites often capitalize titles or truncate texts for preview etc.
AFAIK there is no such tool, and in my opinion there cannot exist one that solves your problem adequately.
Your best choice is to get the source code the site uses/used and analyze it. If that fails/is not possible you have to analyse the database manually. Get as much content as possible from the URL and try to fit the puzzle.

Resources