Reading from a binary file with a given data structure - python-3.x

I want to read the 'Last Traded Price' from the given binary file. How do I extract a specific data out of the file by using notations like 'hhl10s6sc'. I know I have to use the struct.unpack method, but where can I learn to write such formatting (with some illustrations) so that I can extract any data that I want from such a binary file.
The thing that is troubling me is the unpacking that the writer of the code (that I'm trying to understand) has written - 'hlhcl6s10s11s10s2s1s10s12schc' . I understood what 6s...12s mean, but what's the significance of the 'hlhcl' (5 characters in the beginning) and 'chc' (3 characters in the last). The writer has tried to retrieve the 'Last traded price' from the data structure.
If you could give some examples and/or some sources for the same, it'd be very helpful. Attached the image which contains the data structure of the given file.
This image shows the data structure

struct format strings are fields described in order. Every letter is a format character, so hlhcl translates to "short, long, short, character, long". This doesn't resemble the image you linked (which is a tad impractical as it's off-site and another step to look up), which starts with a single long and otherwise holds only strings. It might apply to a protocol wrapping that packet.

Related

How can you dynamically format a string with a user-provided template and slice of parameters in Go?

I have user-provided format strings, and for each, I have a corresponding slice. For instance, I might have Test string {{1}}: {{2}} and ["number 1", "The Bit Afterwards"]. I want to generate Test string number 1: The Bit Afterwards from this.
The format of the user-provided strings is not fixed, and can be changed if need be. However, I cannot guarantee their sanity or safety; neither can I guarantee that any given character will not be used in the string, so any tags (like {} in my example) must be escapable. I also cannot guarantee that the same number of slice values will exist as tags in the template - for example, I might quite reasonably have Test string {{1}} and ["number 1", "another parameter", "yet another parameter"].
How can I efficiently format these strings, in accordance with the input given? They are for use as strings only, and don't require HTML, SQL or any other sort of escaping.
Things I've already considered:
fmt.Sprintf - two issues: 1) using it with user-provided templates is not ideal; 2) Sprintf does not play nicely with a number of parameters that doesn't match its format string, adding %!(EXTRA type=value) to the end.
The text/template library. This would work fine in theory, but I don't want to have to make users type out {{index .arr n}} for each and every one of their tags; in this case, I only ever need slice indexes.
The valyala/fasttemplate library. This is pretty much exactly what I'm looking for, but for the fact that it doesn't currently support escaping the delimiters it uses for its tags, at the time of writing. I've opened an issue for this, but I would have thought that there's already a solution to this problem somewhere - it doesn't feel like it's that unique.
Just writing my own parser for it. This would work... but, as above, I can't be the first person to have come across this!
Any advice or suggestions would be greatly appreciated.

PL/I & not printing on tab stop

I have the following PL/I code:
declare 1 u union,
2 c character(1),
2 ci fixed binary(4) unsigned;
ci = data_mem(data_ptr);
put list (c);
What this does, is it takes an integer and outputs that as if it was an ascii/ebcdic value. So it shows characters. Sofar this works.
The problem now is that each character is printed at an 24 spaces interval, as if 3 TABs are inserted.
I tried converting c to a string first and then applying trim() but that did not help.
Any ideas?
This is the default PUT LIST behavior for a PRINT-attribute file. From the IBM Enterprise PL/I for z/OS Language Reference, under Stream-oriented Data Transmission -> LIST -> PUT list-directed (emphasis mine):
The values of the data-list items are converted to character representations (except for graphics) and transmitted to the data stream. A blank separates successive data values transmitted. For PRINT files, items are separated according to program tab settings (see “PRINT attribute”).
The next manual section discusses the PRINT attribute. Here we have
Data values transmitted by list- and data-directed data transmission are
automatically aligned on the left margin and on implementation-defined preset tab
positions.
Since you omitted FILE, your PUT is going to the default FILE(SYSPRINT). SYSPRINT is defined implicitly as FILE ENVIRONMENT(F RECSIZE(121)) OUTPUT PRINT STREAM (see Input and Output -> FILE Attribute -> File constant in the Language Reference, and Defining and Using Consecutive Data Sets -> Using PRINT files with stream I/O in the Programmer's Guide). IIRC, the defaults are every 24, which gives 5 tabs per line, compatible with the old 120-byte printers common in the early days of PL/I F in the late 1960s. This is modifiable by declaring a PLITABS structure (described in the previously mentioned manual section).
LIST- and DATA-directed I/O are intended to be quick & dirty I/O interfaces with little regard for formatting on output (but are very forgiving on input). EDIT is better for formatting output, but it does show a lot of its FORTRAN roots for input and output. Personally, for traditional reports using formatted output, and for record input, I would work with record I/O, which is analogous to standard COBOL I/O.

SUM not working 'Invalid or missing field format'

I have an input file in this format: (length 20, 10 chars and 10 numerics)
jname1 0000500006
bname1 0000100002
wname1 0000400007
yname1 0000000006
jname1 0000100001
mname1 0000500012
mname2 0000700013
In my jcl I have defined my sysin data as such:
SYSIN DATA *
SORT FIELDS=(1,1,CH,A)
SUM FIELDS=(11,10,FD)
DATAEND
*
It works fine as long as I don't add the sum fields so I'm wondering if I'm using the wrong format for my numerics seeing as I know they start at field 11 and have a length of 10 the format is the only thing that could be wrong.
As you might have already realised the point of this JCL is to just list the values but grouped by the first letter of the name (so for the example data and JCL I have given it would group the numeric for mname1 and mname2 together but leave the other records untouched).
I'm kind of new at this so I was wonder what I need for the format if my numerics are like that in the input file.
If new to DFSORT, get hold of the DFSORT Getting Started guide for your version of DFSORT (http://www-01.ibm.com/support/docview.wss?uid=isg3T7000080).
This takes your through all the basic operations with many examples.
The DFSORT Application Programming Guide describes everything you need to know, in detail. Again with examples. Appendix C of that document contains all the data-types available (note, when you tried to use FD, FD is not valid data-type, so probably a typo). There are Tables throughout the document listing what data-types are available where, if there is a particular limit.
For advanced techniques, consult the DFSORT Smart Tricks publication here: http://www-01.ibm.com/support/docview.wss?uid=isg3T7000094
You need to understand a bit more the way data is stored on a Mainframe as well.
Decimals (which can be "packed-decimal" or "zoned-decimal") do not contain a decimal-point. The decimal-point is implied. In high-level languages you tell the compiler where the decimal-point is (in a fixed position) and the compiler does the alignments for you. In Assembler, you do everything yourself.
Decimals are 100% accurate, as there are machine-instructions which act directly on packed-decimal data giving packed-decimal results.
A field which actually contains a decimal-point, cannot be directly used in arithmetic.
An unsigned field is treated as positive when used in any arithmetic.
The SUM statement supports a limited number of numeric definitions, and you have chosen the correct one. It does not matter that your data is unsigned.
If the format of the output from SUM is not what you want, look at OPTION ZDPRINT (or NOZDPRINT).
If you want further formatting, you can use OUTREC or OUTFIL.
As an option to using SUM, you can use OUTFIL reporting functions (especially, although not limited to, if you want a report). You can use SECTIONS and TRAILER3 with TOT/TOTAL.
Something to watch for with SUM (which is not a problem with the reporting features) is if any given one (or more) of your SUMmed fields exceed the field size. To continue to use SUM if that happens, you need to extend the field in INREC and then get SUM to use the new, sufficient, size.
After some trial and error I finally found it, appearantly the format I needed to use was the ZD format (zoned decimal, signed), so my sysin becomes this:
SYSIN DATA *
SORT FIELDS=(1,1,CH,A)
SUM FIELDS=(11,10,ZD)
DATAEND
*
even though my records don't contain any decimals and they are unsigned, I don't really get it so if someone knows why it's like that please go ahead and explain it to me.
For now the way I'm going to remember it is this: Z = symbol for real (meaning integers so no decimals)

how to use zlib.compressobj(..., zdict)

i'm trying to preset zlib's dictionary for compression. as of python 3.3 zlib.compressobj function offers the option. the docs say it should be some bytesarray or a bytes object e.g. b"often-found".
now: how to pass multiple strings ordered ascending by their likeliness to occur as suggested in the docs? is there a secret delimiter e.g. b"likely,more-likely,most-likely"?
No, there is no delimiter needed. All the dictionary is is a resource in which to look for strings that match portions of the data to be compressed. Therefore strings that are likely to occur can simply be concatenated. Or even overlapped if starts and ends match. For example if you want the words lighthouse and household to be available, you can just put lighthousehold in the dictionary.
Since it takes more bits to represent matches that are further back, you would put the most likely matches at the end of the dictionary.

What defines data that can be stored in strings

A few days ago, I asked why its not possible to store binary data, such as a jpg file into a string variable.
Most of the answers I got said that string is used for textual information such as what I'm writing now.
What is considered textual data though? Bytes of a certain nature represent a jpg file and those bytes could be represented by character byte values...I think. So when we say strings are for textual information, is there some sort of range or list of characters that aren't stored?
Sorry if the question sounds silly. Just trying to 'get it'
I see three major problems with storing binary data in strings:
Most systems assume a certain encoding within string variables - e.g. if it's a UTF-8, UTF-16 or ASCII string. New line characters may also be translated depending on your system.
You should watch out for restrictions on the size of strings.
If you use C style strings, every null character in your data will terminate the string and any string operations performed will only work on the bytes up to the first null.
Perhaps the most important: it's confusing - other developers don't expect to find random binary data in string variables. And a lot of code which works on strings might also get really confused when encountering binary data :)
I would prefer to store binary data as binary, you would only think of converting it to text when there's no other choice since when you convert it to a textual representation it does waste some bytes (not much, but it still counts), that's how they put attachments in email.
Base64 is a good textual representation of binary files.
I think you are referring to binary to text encoding issue. (translate a jpg into a string would require that sort of pre-processing)
Indeed, in that article, some characters are mentioned as not always supported, other can be confusing:
Some systems have a more limited character set they can handle; not only are they not 8-bit clean, some can't even handle every printable ASCII character.
Others have limits on the number of characters that may appear between line breaks.
Still others add headers or trailers to the text.
And a few poorly-regarded but still-used protocols use in-band signaling, causing confusion if specific patterns appear in the message. The best-known is the string "From " (including trailing space) at the beginning of a line used to separate mail messages in the mbox file format.
Whoever told you you can't put 'binary' data into a string was wrong. A string simply represents an array of bytes that you most likely plan on using for textual data... but there is nothing stopping you from putting any data in there you want.
I do have to be careful though, because I don't know what language you are using... and in some languages \0 ends the string.
In C#, you can put any data into a string... example:
byte[] myJpegByteArray = GetBytesFromSomeImage();
string myString = Encoding.ASCII.GetString(myJpegByteArray);
Before internationalization, it didn't make much difference. ASCII characters are all bytes, so strings, character arrays and byte arrays ended up having the same implementation.
These days, though, strings are a lot more complicated, in order to deal with thousands of foreign language characters and the linguistic rules that go with them.
Sure, if you look deep enough, everything is just bits and bytes, but there's a world of difference in how the computer interprets them. The rules for "text" make things look right when it's displayed to a human, but the computer is free to monkey with the internal representation. For example,
In Unicode, there are many encoding systems. Changing between them makes every byte different.
Some languages have multiple characters that are linguistically equivalent. These could switch back and forth when you least expect it.
There are different ways to end a line of text. Unintended translations between CRLF and LF will break a binary file.
Deep down everything is just bytes.
Things like strings and pictures are defined by rules about how to order bytes.
strings for example end in a byte with value 32 (or something else)
jpg's don't
Depends on the language. For example in Python string types (str) are really byte arrays, so they can indeed be used for binary data.
In C the NULL byte is used for string termination, so a sting cannot be used for arbitrary binary data, since binary data could contain null bytes.
In C# a string is an array of chars, and since a char is basically an alias for 16bit int, you can probably get away with storing arbitrary binary data in a string. You might get errors when you try to display the string (because some values might not actually correspond to a legal unicode character), and some operations like case conversions will probably fail in strange ways.
In short it might be possible in some langauges to store arbitrary binary data in strings, but they are not designed for this use, and you may run into all kinds of unforseen trouble. Most languages have a byte-array type for storing arbitrary binary data.
I agree with Jacobus' answer:
In the end all data structures are made up of bytes. (Well, if you go even deeper: of bits). With some abstraction, you could say that a string or a byte array are conventions for programmers, on how to access them.
In this regard, the string is an abstraction for data interpreted as a text. Text was invented for communication among humans, computers or programs do not communicate very well using text. SQL is textual, but is an interface for humans to tell a database what to do.
So in general, textual data, and therefore strings, are primarily for human to human, or human to machine interaction (say for the content of a message box). Using them for something else (e.g. reading or writing binary image data) is possible, but carries lots of risk bacause you are using the data type for something it was not designed to handle. This makes it much more error prone. You may be able to store binary data in strings, mbut just because you are able to shoot yourself in the foot, you should avoid doing so.
Summary: You can do it. But you better don't.
Your original question (c# - What is string really good for?) made very little sense. So the answers didn't make sense, either.
Your original question said "For some reason though, when I write this string out to a file, it doesn't open." Which doesn't really mean much.
Your original question was incomplete, and the answers were misleading and confusing. You CAN store anything in a String. Period. The "strings are for text" answers were there because you didn't provide enough information in your question to determine what's going wrong with your particular bit of C# code.
You didn't provide a code snippet or an error message. That's why it's hard to 'get it' -- you're not providing enough details for us to know what you don't get.

Resources